Last updated: Apr 12, 2024
Reading time·5 min

To filter rows in a Pandas DataFrame using a regex:
str.contains() method to test if a regex matches each value in a
specific column.Series of boolean values.import pandas as pd df = pd.DataFrame({ 'name': ['Alice', 'Alex', 'Bobby', 'Tony', 'Ethan'], 'sales': [1, 3, 5, 7, 7], 'salary': [175.1, 180.2, 190.3, 205.4, 210.5], }) regex = r'^Al' starting_with = df[df['name'].str.contains(regex)] # name sales salary # 0 Alice 1 175.1 # 1 Alex 3 180.2 print(starting_with)

The code sample selects the rows in the name column that start with "Al".
The str.contains() method tests if a regex matches each value in the given column.
import pandas as pd df = pd.DataFrame({ 'name': ['Alice', 'Alex', 'Bobby', 'Tony', 'Ethan'], 'sales': [1, 3, 5, 7, 7], 'salary': [175.1, 180.2, 190.3, 205.4, 210.5], }) regex = r'^Al' # 0 True # 1 True # 2 False # 3 False # 4 False # Name: name, dtype: bool print(df['name'].str.contains(regex))
The last step is to use bracket notation to select the matching rows.
regex = r'^Al' starting_with = df[df['name'].str.contains(regex)]
By default, the str.contains method is case-sensitive.
If you want to make it case-insensitive, set the case argument to False.
import pandas as pd df = pd.DataFrame({ 'name': ['Alice', 'Alex', 'Bobby', 'Tony', 'Ethan'], 'sales': [1, 3, 5, 7, 7], 'salary': [175.1, 180.2, 190.3, 205.4, 210.5], }) regex = r'^al' starting_with = df[df['name'].str.contains(regex, case=False)] # name sales salary # 0 Alice 1 175.1 # 1 Alex 3 180.2 print(starting_with)

When the case argument is set to False, the str.contains method becomes
case-insensitive.
You can also set the flags argument when calling the method.
import re import pandas as pd df = pd.DataFrame({ 'name': ['Alice', 'Alex', 'Bobby', 'Tony', 'Ethan'], 'sales': [1, 3, 5, 7, 7], 'salary': [175.1, 180.2, 190.3, 205.4, 210.5], }) regex = r'^al' starting_with = df[df['name'].str.contains( regex, case=False, flags=re.IGNORECASE)] # name sales salary # 0 Alice 1 175.1 # 1 Alex 3 180.2 print(starting_with)
The flags argument represents flags that are passed to the re module.
If you need to get the rows with the values that don't match the regex, use the
tilde ~ operator.
import pandas as pd df = pd.DataFrame({ 'name': ['Alice', 'Alex', 'Bobby', 'Tony', 'Ethan'], 'sales': [1, 3, 5, 7, 7], 'salary': [175.1, 180.2, 190.3, 205.4, 210.5], }) regex = r'^Al' starting_with = df[~df['name'].str.contains(regex)] # name sales salary # 2 Bobby 5 190.3 # 3 Tony 7 205.4 # 4 Ethan 7 210.5 print(starting_with)

The code sample uses the tilde ~ operator to invert the boolean values in the
Series before filtering the DataFrame based on the regex.
In some cases, you might be able to use a built-in method to filter the rows (instead of a regex).
For example, if you need to select the rows that start with a specific
substring, you can use the str.startswith() method.
import pandas as pd df = pd.DataFrame({ 'name': ['Alice', 'Alex', 'Bobby', 'Tony', 'Ethan'], 'sales': [1, 3, 5, 7, 7], 'salary': [175.1, 180.2, 190.3, 205.4, 210.5], }) starting_with = df[df['name'].str.startswith('Al')] # name sales salary # 0 Alice 1 175.1 # 1 Alex 3 180.2 print(starting_with)

There are many built-in methods on the str attribute that you might be able to
use instead of a regex.
You can also use the str.contains() method if you need to select the rows that
contain a certain substring using a regex.
import pandas as pd df = pd.DataFrame({ 'name': ['Alice', 'Alex', 'Bony', 'Tony', 'Ethan'], 'sales': [1, 3, 5, 7, 7], 'salary': [175.1, 180.2, 190.3, 205.4, 210.5], }) regex = r'.*on.*' containing = df[df['name'].str.contains(regex)] # name sales salary # 2 Bony 5 190.3 # 3 Tony 7 205.4 print(containing)

The code sample selects the rows with values that contain the substring "on".
If you need to perform the test in a case-insensitive manner, set the case or
flags argument, depending on whether you use a regular expression.
import re import pandas as pd df = pd.DataFrame({ 'name': ['Alice', 'Alex', 'Bony', 'Tony', 'Ethan'], 'sales': [1, 3, 5, 7, 7], 'salary': [175.1, 180.2, 190.3, 205.4, 210.5], }) regex = r'.*ON.*' containing = df[df['name'].str.contains( regex, case=False, flags=re.IGNORECASE)] # name sales salary # 2 Bony 5 190.3 # 3 Tony 7 205.4 print(containing)
You could also select the rows that contain a given substring without a regular expression.
import pandas as pd df = pd.DataFrame({ 'name': ['Alice', 'Alex', 'Bony', 'Tony', 'Ethan'], 'sales': [1, 3, 5, 7, 7], 'salary': [175.1, 180.2, 190.3, 205.4, 210.5], }) substring = 'on' containing = df[df['name'].str.contains(substring, case=False, regex=False)] # name sales salary # 2 Bony 5 190.3 # 3 Tony 7 205.4 print(containing)

Notice that we set the regex argument to False.
If the regex argument is set to False, the supplied pattern (the first
argument) is treated as a literal string.
By default, the pattern argument is treated as a regular expression.
The str.contains() method can also be used if you need to select the rows that
end with a certain substring using a regex.
import pandas as pd df = pd.DataFrame({ 'name': ['Alice', 'Yoni', 'Bony', 'Tony', 'Ethan'], 'sales': [1, 3, 5, 7, 7], 'salary': [175.1, 180.2, 190.3, 205.4, 210.5], }) regex = r'y$' ending_with = df[df['name'].str.contains(regex)] # name sales salary # 2 Bony 5 190.3 # 3 Tony 7 205.4 print(ending_with)

The code sample selects the rows whose name values end with y.
If you need to perform the test in a case-insensitive manner, set the case or
flags arguments when calling str.contains().
import re import pandas as pd df = pd.DataFrame({ 'name': ['Alice', 'Yoni', 'Bony', 'Tony', 'Ethan'], 'sales': [1, 3, 5, 7, 7], 'salary': [175.1, 180.2, 190.3, 205.4, 210.5], }) regex = r'Y$' ending_with = df[df['name'].str.contains( regex, case=False, flags=re.IGNORECASE)] # name sales salary # 2 Bony 5 190.3 # 3 Tony 7 205.4 print(ending_with)

You can also select the rows that end with a certain substring using the
built-in str.endswith() method instead of using a regex.
import pandas as pd df = pd.DataFrame({ 'name': ['Alice', 'Yoni', 'Bony', 'Tony', 'Ethan'], 'sales': [1, 3, 5, 7, 7], 'salary': [175.1, 180.2, 190.3, 205.4, 210.5], }) ending_with = df[df['name'].str.endswith('y')] # name sales salary # 2 Bony 5 190.3 # 3 Tony 7 205.4 print(ending_with)

I've also written articles on:
You can learn more about the related topics by checking out the following tutorials: