Pandas: Get the Rows that are NOT in another DataFrame

avatar
Borislav Hadzhiev

Last updated: Aug 7, 2023
5 min

banner

# Table of Contents

  1. Pandas: Get the Rows that are NOT in another DataFrame
  2. Pandas: Get the Rows that are NOT in another DataFrame using inverted isin() calls

# Pandas: Get the Rows that are NOT in another DataFrame

To get the rows that are NOT in another DataFrame:

  1. Use the DataFrame.merge() method to perform a left join from df1 to df2.
  2. Get the rows that come only from the first DataFrame.
  3. Filter the original DataFrame using the boolean Series.
main.py
import pandas as pd df1 = pd.DataFrame({ 'a': [2, 4, 6, 8, 10], 'b': [112, 114, 115, 118, 120] }) df2 = pd.DataFrame({ 'a': [2, 4, 6, 9, 11], 'b': [112, 114, 115, 190, 250] }) all_rows = df1.merge(df2.drop_duplicates(), on=[ 'a', 'b'], how='left', indicator=True) print(all_rows) rows_not_in = all_rows[all_rows['_merge'] == 'left_only'] print('-' * 50) print(rows_not_in)
The code for this article is available on GitHub

Running the code sample produces the following output.

shell
a b _merge 0 2 112 both 1 4 114 both 2 6 115 both 3 8 118 left_only 4 10 120 left_only -------------------------------------------------- a b _merge 3 8 118 left_only 4 10 120 left_only

get rows not in another dataframe

We used the DataFrame.merge method to perform a left join, selecting the rows that are in both DataFrames and those that are contained in the first DataFrame only (df1).

main.py
all_rows = df1.merge(df2.drop_duplicates(), on=[ 'a', 'b'], how='left', indicator=True) # a b _merge # 0 2 112 both # 1 4 114 both # 2 6 115 both # 3 8 118 left_only # 4 10 120 left_only print(all_rows)

The code sample selects the rows in df1 that are NOT contained in df2.

Notice that we used the DataFrame.drop_duplicates method to remove the duplicate rows from the DataFrame before calling merge().

This way, we can be sure that each row of the first DataFrame is joined with exactly 1 row of the second DataFrame.

We set the indicator parameter to True to add a _merge column to the output DataFrame.

The _merge column contains information about the DataFrame from which the row was selected.

The _merge column will have a value of "left_only" for rows whose merge key only appears in the left DataFrame.

If the row has a value of "both", then the observation's merge key is found in both DataFrames.

You can set the indicator argument to a string instead of a boolean in case you want to rename the _merge column to something else.

Here is an example that names the column source_df instead of _merge.

main.py
import pandas as pd df1 = pd.DataFrame({ 'a': [2, 4, 6, 8, 10], 'b': [112, 114, 115, 118, 120] }) df2 = pd.DataFrame({ 'a': [2, 4, 6, 9, 11], 'b': [112, 114, 115, 190, 250] }) all_rows = df1.merge(df2.drop_duplicates(), on=[ 'a', 'b'], how='left', indicator='source_df') print(all_rows) rows_not_in = all_rows[all_rows['source_df'] == 'left_only'] print('-' * 50) print(rows_not_in)
The code for this article is available on GitHub

Running the code sample produces the following output.

shell
a b source_df 0 2 112 both 1 4 114 both 2 6 115 both 3 8 118 left_only 4 10 120 left_only -------------------------------------------------- a b source_df 3 8 118 left_only 4 10 120 left_only

get rows not in another dataframe renaming merge column

As shown in the output, the _merge column is now named source_df.

If you simply need to remove the _merge column from the resulting DataFrame, use the DataFrame.drop method.

main.py
import pandas as pd df1 = pd.DataFrame({ 'a': [2, 4, 6, 8, 10], 'b': [112, 114, 115, 118, 120] }) df2 = pd.DataFrame({ 'a': [2, 4, 6, 9, 11], 'b': [112, 114, 115, 190, 250] }) all_rows = df1.merge(df2.drop_duplicates(), on=[ 'a', 'b'], how='left', indicator=True) print(all_rows) rows_not_in = all_rows[all_rows['_merge'] == 'left_only'] print('-' * 50) rows_not_in = rows_not_in.drop(['_merge'], axis=1) print(rows_not_in)
The code for this article is available on GitHub

Running the code sample produces the following output.

shell
a b _merge 0 2 112 both 1 4 114 both 2 6 115 both 3 8 118 left_only 4 10 120 left_only -------------------------------------------------- a b 3 8 118 4 10 120

drop merge column after getting rows not in dataframe

Make sure to drop the _merge column after you've selected the rows in df1 that are NOT in df2 (and not before).

# Pandas: Get the Rows that are NOT in another DataFrame using inverted isin() calls

You can also use inverted isin() method calls with the tilde ~ operator to find the rows that are not in another DataFrame.

main.py
import pandas as pd df1 = pd.DataFrame({ 'a': [2, 4, 6, 8, 10], 'b': [112, 114, 115, 118, 120] }) df2 = pd.DataFrame({ 'a': [2, 4, 6, 9, 11], 'b': [112, 114, 115, 190, 250] }) all_rows = df1.merge(df2, on=['a', 'b']) print(all_rows) rows_not_in = df1[(~df1['a'].isin(all_rows['a'])) & (~df1['b'].isin(all_rows['b']))] print('-' * 50) print(rows_not_in)
The code for this article is available on GitHub

Running the code sample produces the following output.

shell
a b 0 2 112 1 4 114 2 6 115 -------------------------------------------------- a b 3 8 118 4 10 120

get rows not in another dataframe using inverted isin

We first used the merge() method to get the intersection of the columns in both DataFrames.

main.py
all_rows = df1.merge(df2, on=['a', 'b']) # a b # 0 2 112 # 1 4 114 # 2 6 115 print(all_rows)

The next step is to get all rows that are NOT in the other DataFrame by using inverted isin() calls.

main.py
rows_not_in = df1[(~df1['a'].isin(all_rows['a'])) & (~df1['b'].isin(all_rows['b']))] # a b # 3 8 118 # 4 10 120 print(rows_not_in)

The DataFrame.isin() method tests whether each element in the DataFrame is contained in the specified values.

We used the tilde ~ operator to invert the results.

In this case, you can also shorten this a bit by using the DataFrame.dropna() method to remove any missing values.

main.py
import pandas as pd df1 = pd.DataFrame({ 'a': [2, 4, 6, 8, 10], 'b': [112, 114, 115, 118, 120] }) df2 = pd.DataFrame({ 'a': [2, 4, 6, 9, 11], 'b': [112, 114, 115, 190, 250] }) rows_not_in = df1[~df1.isin(df2)].dropna(how='all') print(rows_not_in)
The code for this article is available on GitHub

Running the code sample produces the following output.

shell
a b 3 8.0 118.0 4 10.0 120.0

get rows not in another dataframe using dropna

When the how argument of the dropna() method is set to "all", then if all values are NA, the corresponding row or column is dropped.

# Additional Resources

You can learn more about the related topics by checking out the following tutorials:

I wrote a book in which I share everything I know about how to become a better, more efficient programmer.
book cover
You can use the search field on my Home Page to filter through all of my articles.