Last updated: Apr 12, 2024
Reading time·5 min
To filter rows in a Pandas DataFrame using a regex:
str.contains()
method to test if a regex matches each value in a
specific column.Series
of boolean values.import pandas as pd df = pd.DataFrame({ 'name': ['Alice', 'Alex', 'Bobby', 'Tony', 'Ethan'], 'sales': [1, 3, 5, 7, 7], 'salary': [175.1, 180.2, 190.3, 205.4, 210.5], }) regex = r'^Al' starting_with = df[df['name'].str.contains(regex)] # name sales salary # 0 Alice 1 175.1 # 1 Alex 3 180.2 print(starting_with)
The code sample selects the rows in the name
column that start with "Al"
.
The str.contains() method tests if a regex matches each value in the given column.
import pandas as pd df = pd.DataFrame({ 'name': ['Alice', 'Alex', 'Bobby', 'Tony', 'Ethan'], 'sales': [1, 3, 5, 7, 7], 'salary': [175.1, 180.2, 190.3, 205.4, 210.5], }) regex = r'^Al' # 0 True # 1 True # 2 False # 3 False # 4 False # Name: name, dtype: bool print(df['name'].str.contains(regex))
The last step is to use bracket notation to select the matching rows.
regex = r'^Al' starting_with = df[df['name'].str.contains(regex)]
By default, the str.contains
method is case-sensitive.
If you want to make it case-insensitive, set the case
argument to False
.
import pandas as pd df = pd.DataFrame({ 'name': ['Alice', 'Alex', 'Bobby', 'Tony', 'Ethan'], 'sales': [1, 3, 5, 7, 7], 'salary': [175.1, 180.2, 190.3, 205.4, 210.5], }) regex = r'^al' starting_with = df[df['name'].str.contains(regex, case=False)] # name sales salary # 0 Alice 1 175.1 # 1 Alex 3 180.2 print(starting_with)
When the case
argument is set to False
, the str.contains
method becomes
case-insensitive.
You can also set the flags
argument when calling the method.
import re import pandas as pd df = pd.DataFrame({ 'name': ['Alice', 'Alex', 'Bobby', 'Tony', 'Ethan'], 'sales': [1, 3, 5, 7, 7], 'salary': [175.1, 180.2, 190.3, 205.4, 210.5], }) regex = r'^al' starting_with = df[df['name'].str.contains( regex, case=False, flags=re.IGNORECASE)] # name sales salary # 0 Alice 1 175.1 # 1 Alex 3 180.2 print(starting_with)
The flags argument represents flags that are passed to the re
module.
If you need to get the rows with the values that don't match the regex, use the
tilde ~
operator.
import pandas as pd df = pd.DataFrame({ 'name': ['Alice', 'Alex', 'Bobby', 'Tony', 'Ethan'], 'sales': [1, 3, 5, 7, 7], 'salary': [175.1, 180.2, 190.3, 205.4, 210.5], }) regex = r'^Al' starting_with = df[~df['name'].str.contains(regex)] # name sales salary # 2 Bobby 5 190.3 # 3 Tony 7 205.4 # 4 Ethan 7 210.5 print(starting_with)
The code sample uses the tilde ~
operator to invert the boolean values in the
Series
before filtering the DataFrame
based on the regex.
In some cases, you might be able to use a built-in method to filter the rows (instead of a regex).
For example, if you need to select the rows that start with a specific
substring, you can use the str.startswith()
method.
import pandas as pd df = pd.DataFrame({ 'name': ['Alice', 'Alex', 'Bobby', 'Tony', 'Ethan'], 'sales': [1, 3, 5, 7, 7], 'salary': [175.1, 180.2, 190.3, 205.4, 210.5], }) starting_with = df[df['name'].str.startswith('Al')] # name sales salary # 0 Alice 1 175.1 # 1 Alex 3 180.2 print(starting_with)
There are many built-in methods on the str
attribute that you might be able to
use instead of a regex.
You can also use the str.contains()
method if you need to select the rows that
contain a certain substring using a regex.
import pandas as pd df = pd.DataFrame({ 'name': ['Alice', 'Alex', 'Bony', 'Tony', 'Ethan'], 'sales': [1, 3, 5, 7, 7], 'salary': [175.1, 180.2, 190.3, 205.4, 210.5], }) regex = r'.*on.*' containing = df[df['name'].str.contains(regex)] # name sales salary # 2 Bony 5 190.3 # 3 Tony 7 205.4 print(containing)
The code sample selects the rows with values that contain the substring "on"
.
If you need to perform the test in a case-insensitive manner, set the case
or
flags
argument, depending on whether you use a regular expression.
import re import pandas as pd df = pd.DataFrame({ 'name': ['Alice', 'Alex', 'Bony', 'Tony', 'Ethan'], 'sales': [1, 3, 5, 7, 7], 'salary': [175.1, 180.2, 190.3, 205.4, 210.5], }) regex = r'.*ON.*' containing = df[df['name'].str.contains( regex, case=False, flags=re.IGNORECASE)] # name sales salary # 2 Bony 5 190.3 # 3 Tony 7 205.4 print(containing)
You could also select the rows that contain a given substring without a regular expression.
import pandas as pd df = pd.DataFrame({ 'name': ['Alice', 'Alex', 'Bony', 'Tony', 'Ethan'], 'sales': [1, 3, 5, 7, 7], 'salary': [175.1, 180.2, 190.3, 205.4, 210.5], }) substring = 'on' containing = df[df['name'].str.contains(substring, case=False, regex=False)] # name sales salary # 2 Bony 5 190.3 # 3 Tony 7 205.4 print(containing)
Notice that we set the regex
argument to False
.
If the regex
argument is set to False
, the supplied pattern (the first
argument) is treated as a literal string.
By default, the pattern argument is treated as a regular expression.
The str.contains()
method can also be used if you need to select the rows that
end with a certain substring using a regex.
import pandas as pd df = pd.DataFrame({ 'name': ['Alice', 'Yoni', 'Bony', 'Tony', 'Ethan'], 'sales': [1, 3, 5, 7, 7], 'salary': [175.1, 180.2, 190.3, 205.4, 210.5], }) regex = r'y$' ending_with = df[df['name'].str.contains(regex)] # name sales salary # 2 Bony 5 190.3 # 3 Tony 7 205.4 print(ending_with)
The code sample selects the rows whose name
values end with y
.
If you need to perform the test in a case-insensitive manner, set the case
or
flags
arguments when calling str.contains()
.
import re import pandas as pd df = pd.DataFrame({ 'name': ['Alice', 'Yoni', 'Bony', 'Tony', 'Ethan'], 'sales': [1, 3, 5, 7, 7], 'salary': [175.1, 180.2, 190.3, 205.4, 210.5], }) regex = r'Y$' ending_with = df[df['name'].str.contains( regex, case=False, flags=re.IGNORECASE)] # name sales salary # 2 Bony 5 190.3 # 3 Tony 7 205.4 print(ending_with)
You can also select the rows that end with a certain substring using the
built-in str.endswith()
method instead of using a regex.
import pandas as pd df = pd.DataFrame({ 'name': ['Alice', 'Yoni', 'Bony', 'Tony', 'Ethan'], 'sales': [1, 3, 5, 7, 7], 'salary': [175.1, 180.2, 190.3, 205.4, 210.5], }) ending_with = df[df['name'].str.endswith('y')] # name sales salary # 2 Bony 5 190.3 # 3 Tony 7 205.4 print(ending_with)
I've also written articles on:
You can learn more about the related topics by checking out the following tutorials: