Filter rows in a Pandas DataFrame using Regex

# Table of Contents

# Filter rows in a Pandas DataFrame using Regex

To filter rows in a Pandas DataFrame using a regex:

Use the str.contains() method to test if a regex matches each value in a specific column.
Use bracket notation to filter the rows by the Series of boolean values.

main.py

Copied!
import pandas as pd


df = pd.DataFrame({
    'name': ['Alice', 'Alex', 'Bobby', 'Tony', 'Ethan'],
    'sales': [1, 3, 5, 7, 7],
    'salary': [175.1, 180.2, 190.3, 205.4, 210.5],
})

regex = r'^Al'

starting_with = df[df['name'].str.contains(regex)]

#     name  sales  salary
# 0  Alice      1   175.1
# 1   Alex      3   180.2
print(starting_with)

filter rows in pandas dataframe using regex

The code for this article is available on GitHub

The code sample selects the rows in the name column that start with "Al".

The str.contains() method tests if a regex matches each value in the given column.

main.py

Copied!
import pandas as pd


df = pd.DataFrame({
    'name': ['Alice', 'Alex', 'Bobby', 'Tony', 'Ethan'],
    'sales': [1, 3, 5, 7, 7],
    'salary': [175.1, 180.2, 190.3, 205.4, 210.5],
})

regex = r'^Al'

# 0     True
# 1     True
# 2    False
# 3    False
# 4    False
# Name: name, dtype: bool
print(df['name'].str.contains(regex))

The code for this article is available on GitHub

The last step is to use bracket notation to select the matching rows.

main.py

Copied!
regex = r'^Al'

starting_with = df[df['name'].str.contains(regex)]

By default, the str.contains method is case-sensitive.

If you want to make it case-insensitive, set the case argument to False.

main.py

Copied!
import pandas as pd


df = pd.DataFrame({
    'name': ['Alice', 'Alex', 'Bobby', 'Tony', 'Ethan'],
    'sales': [1, 3, 5, 7, 7],
    'salary': [175.1, 180.2, 190.3, 205.4, 210.5],
})

regex = r'^al'


starting_with = df[df['name'].str.contains(regex, case=False)]

#     name  sales  salary
# 0  Alice      1   175.1
# 1   Alex      3   180.2
print(starting_with)

filter dataframe rows by regex case insensitive

The code for this article is available on GitHub

When the case argument is set to False, the str.contains method becomes case-insensitive.

You can also set the flags argument when calling the method.

main.py

Copied!
import re
import pandas as pd


df = pd.DataFrame({
    'name': ['Alice', 'Alex', 'Bobby', 'Tony', 'Ethan'],
    'sales': [1, 3, 5, 7, 7],
    'salary': [175.1, 180.2, 190.3, 205.4, 210.5],
})

regex = r'^al'


starting_with = df[df['name'].str.contains(
    regex, case=False, flags=re.IGNORECASE)]

#     name  sales  salary
# 0  Alice      1   175.1
# 1   Alex      3   180.2
print(starting_with)

The flags argument represents flags that are passed to the re module.

# Getting the rows with the values that DON'T match the regex

If you need to get the rows with the values that don't match the regex, use the tilde ~ operator.

main.py

Copied!
import pandas as pd


df = pd.DataFrame({
    'name': ['Alice', 'Alex', 'Bobby', 'Tony', 'Ethan'],
    'sales': [1, 3, 5, 7, 7],
    'salary': [175.1, 180.2, 190.3, 205.4, 210.5],
})

regex = r'^Al'


starting_with = df[~df['name'].str.contains(regex)]

#     name  sales  salary
# 2  Bobby      5   190.3
# 3   Tony      7   205.4
# 4  Ethan      7   210.5
print(starting_with)

getting rows with values not matching regex in dataframe

The code for this article is available on GitHub

The code sample uses the tilde ~ operator to invert the boolean values in the Series before filtering the DataFrame based on the regex.

# Filtering rows with a built-in method

In some cases, you might be able to use a built-in method to filter the rows (instead of a regex).

For example, if you need to select the rows that start with a specific substring, you can use the str.startswith() method.

main.py

Copied!
import pandas as pd


df = pd.DataFrame({
    'name': ['Alice', 'Alex', 'Bobby', 'Tony', 'Ethan'],
    'sales': [1, 3, 5, 7, 7],
    'salary': [175.1, 180.2, 190.3, 205.4, 210.5],
})


starting_with = df[df['name'].str.startswith('Al')]

#     name  sales  salary
# 0  Alice      1   175.1
# 1   Alex      3   180.2
print(starting_with)

using built in method instead of regex when filtering

The code for this article is available on GitHub

There are many built-in methods on the str attribute that you might be able to use instead of a regex.

# Selecting the rows that contain a certain substring using Regex

You can also use the str.contains() method if you need to select the rows that contain a certain substring using a regex.

main.py

Copied!
import pandas as pd


df = pd.DataFrame({
    'name': ['Alice', 'Alex', 'Bony', 'Tony', 'Ethan'],
    'sales': [1, 3, 5, 7, 7],
    'salary': [175.1, 180.2, 190.3, 205.4, 210.5],
})

regex = r'.*on.*'

containing = df[df['name'].str.contains(regex)]

#    name  sales  salary
# 2  Bony      5   190.3
# 3  Tony      7   205.4
print(containing)

select rows containing substring using regex

The code for this article is available on GitHub

The code sample selects the rows with values that contain the substring "on".

If you need to perform the test in a case-insensitive manner, set the case or flags argument, depending on whether you use a regular expression.

main.py

Copied!
import re
import pandas as pd


df = pd.DataFrame({
    'name': ['Alice', 'Alex', 'Bony', 'Tony', 'Ethan'],
    'sales': [1, 3, 5, 7, 7],
    'salary': [175.1, 180.2, 190.3, 205.4, 210.5],
})

regex = r'.*ON.*'

containing = df[df['name'].str.contains(
    regex, case=False, flags=re.IGNORECASE)]

#    name  sales  salary
# 2  Bony      5   190.3
# 3  Tony      7   205.4
print(containing)

You could also select the rows that contain a given substring without a regular expression.

main.py

Copied!
import pandas as pd


df = pd.DataFrame({
    'name': ['Alice', 'Alex', 'Bony', 'Tony', 'Ethan'],
    'sales': [1, 3, 5, 7, 7],
    'salary': [175.1, 180.2, 190.3, 205.4, 210.5],
})

substring = 'on'

containing = df[df['name'].str.contains(substring, case=False, regex=False)]

#    name  sales  salary
# 2  Bony      5   190.3
# 3  Tony      7   205.4
print(containing)

select rows containing substring without regex

The code for this article is available on GitHub

Notice that we set the regex argument to False.

If the regex argument is set to False, the supplied pattern (the first argument) is treated as a literal string.

By default, the pattern argument is treated as a regular expression.

# Selecting the rows that end with a certain substring using Regex

The str.contains() method can also be used if you need to select the rows that end with a certain substring using a regex.

main.py

Copied!
import pandas as pd


df = pd.DataFrame({
    'name': ['Alice', 'Yoni', 'Bony', 'Tony', 'Ethan'],
    'sales': [1, 3, 5, 7, 7],
    'salary': [175.1, 180.2, 190.3, 205.4, 210.5],
})

regex = r'y$'

ending_with = df[df['name'].str.contains(regex)]

#    name  sales  salary
# 2  Bony      5   190.3
# 3  Tony      7   205.4
print(ending_with)

select rows ending with substring using regex

The code for this article is available on GitHub

The code sample selects the rows whose name values end with y.

If you need to perform the test in a case-insensitive manner, set the case or flags arguments when calling str.contains().

main.py

Copied!
import re
import pandas as pd


df = pd.DataFrame({
    'name': ['Alice', 'Yoni', 'Bony', 'Tony', 'Ethan'],
    'sales': [1, 3, 5, 7, 7],
    'salary': [175.1, 180.2, 190.3, 205.4, 210.5],
})

regex = r'Y$'

ending_with = df[df['name'].str.contains(
    regex, case=False, flags=re.IGNORECASE)]

#    name  sales  salary
# 2  Bony      5   190.3
# 3  Tony      7   205.4
print(ending_with)

filter rows ending with substring using regex ignoring case

You can also select the rows that end with a certain substring using the built-in str.endswith() method instead of using a regex.

main.py

Copied!
import pandas as pd


df = pd.DataFrame({
    'name': ['Alice', 'Yoni', 'Bony', 'Tony', 'Ethan'],
    'sales': [1, 3, 5, 7, 7],
    'salary': [175.1, 180.2, 190.3, 205.4, 210.5],
})

ending_with = df[df['name'].str.endswith('y')]

#    name  sales  salary
# 2  Bony      5   190.3
# 3  Tony      7   205.4
print(ending_with)