Last updated: Aug 7, 2023
Reading time·5 min
To get the rows that are NOT in another DataFrame
:
DataFrame.merge()
method to perform a left join from df1
to
df2
.DataFrame
.DataFrame
using the boolean Series
.import pandas as pd df1 = pd.DataFrame({ 'a': [2, 4, 6, 8, 10], 'b': [112, 114, 115, 118, 120] }) df2 = pd.DataFrame({ 'a': [2, 4, 6, 9, 11], 'b': [112, 114, 115, 190, 250] }) all_rows = df1.merge(df2.drop_duplicates(), on=[ 'a', 'b'], how='left', indicator=True) print(all_rows) rows_not_in = all_rows[all_rows['_merge'] == 'left_only'] print('-' * 50) print(rows_not_in)
Running the code sample produces the following output.
a b _merge 0 2 112 both 1 4 114 both 2 6 115 both 3 8 118 left_only 4 10 120 left_only -------------------------------------------------- a b _merge 3 8 118 left_only 4 10 120 left_only
We used the
DataFrame.merge
method to perform a left join, selecting the rows that are in both DataFrames
and those that are contained in the first DataFrame
only (df1
).
all_rows = df1.merge(df2.drop_duplicates(), on=[ 'a', 'b'], how='left', indicator=True) # a b _merge # 0 2 112 both # 1 4 114 both # 2 6 115 both # 3 8 118 left_only # 4 10 120 left_only print(all_rows)
The code sample selects the rows in df1
that are NOT contained in df2
.
Notice that we used the
DataFrame.drop_duplicates
method to remove the duplicate rows from the DataFrame
before calling
merge()
.
DataFrame
is joined with exactly 1 row of the second DataFrame
.We set the indicator
parameter to True
to add a _merge
column to the
output DataFrame
.
The _merge
column contains information about the DataFrame
from which the
row was selected.
The _merge
column will have a value of "left_only"
for rows whose merge key
only appears in the left DataFrame
.
If the row has a value of "both"
, then the observation's merge key is found in
both DataFrames.
indicator
argument to a string instead of a boolean in case you want to rename the _merge
column to something else.Here is an example that names the column source_df
instead of _merge
.
import pandas as pd df1 = pd.DataFrame({ 'a': [2, 4, 6, 8, 10], 'b': [112, 114, 115, 118, 120] }) df2 = pd.DataFrame({ 'a': [2, 4, 6, 9, 11], 'b': [112, 114, 115, 190, 250] }) all_rows = df1.merge(df2.drop_duplicates(), on=[ 'a', 'b'], how='left', indicator='source_df') print(all_rows) rows_not_in = all_rows[all_rows['source_df'] == 'left_only'] print('-' * 50) print(rows_not_in)
Running the code sample produces the following output.
a b source_df 0 2 112 both 1 4 114 both 2 6 115 both 3 8 118 left_only 4 10 120 left_only -------------------------------------------------- a b source_df 3 8 118 left_only 4 10 120 left_only
As shown in the output, the _merge
column is now named source_df
.
If you simply need to remove the _merge
column from the resulting DataFrame
,
use the DataFrame.drop method.
import pandas as pd df1 = pd.DataFrame({ 'a': [2, 4, 6, 8, 10], 'b': [112, 114, 115, 118, 120] }) df2 = pd.DataFrame({ 'a': [2, 4, 6, 9, 11], 'b': [112, 114, 115, 190, 250] }) all_rows = df1.merge(df2.drop_duplicates(), on=[ 'a', 'b'], how='left', indicator=True) print(all_rows) rows_not_in = all_rows[all_rows['_merge'] == 'left_only'] print('-' * 50) rows_not_in = rows_not_in.drop(['_merge'], axis=1) print(rows_not_in)
Running the code sample produces the following output.
a b _merge 0 2 112 both 1 4 114 both 2 6 115 both 3 8 118 left_only 4 10 120 left_only -------------------------------------------------- a b 3 8 118 4 10 120
Make sure to drop the _merge
column after you've selected the rows in df1
that are NOT in df2
(and not before).
isin()
callsYou can also use inverted isin()
method calls with
the tilde ~
operator
to find the rows that are not in another DataFrame
.
import pandas as pd df1 = pd.DataFrame({ 'a': [2, 4, 6, 8, 10], 'b': [112, 114, 115, 118, 120] }) df2 = pd.DataFrame({ 'a': [2, 4, 6, 9, 11], 'b': [112, 114, 115, 190, 250] }) all_rows = df1.merge(df2, on=['a', 'b']) print(all_rows) rows_not_in = df1[(~df1['a'].isin(all_rows['a'])) & (~df1['b'].isin(all_rows['b']))] print('-' * 50) print(rows_not_in)
Running the code sample produces the following output.
a b 0 2 112 1 4 114 2 6 115 -------------------------------------------------- a b 3 8 118 4 10 120
We first used the merge()
method to
get the intersection of the columns in both DataFrames.
all_rows = df1.merge(df2, on=['a', 'b']) # a b # 0 2 112 # 1 4 114 # 2 6 115 print(all_rows)
The next step is to get all rows that are NOT in the other DataFrame
by using
inverted isin()
calls.
rows_not_in = df1[(~df1['a'].isin(all_rows['a'])) & (~df1['b'].isin(all_rows['b']))] # a b # 3 8 118 # 4 10 120 print(rows_not_in)
The
DataFrame.isin()
method tests whether each element in the DataFrame
is contained in the
specified values.
We used the tilde ~
operator to invert the results.
In this case, you can also shorten this a bit by using the DataFrame.dropna() method to remove any missing values.
import pandas as pd df1 = pd.DataFrame({ 'a': [2, 4, 6, 8, 10], 'b': [112, 114, 115, 118, 120] }) df2 = pd.DataFrame({ 'a': [2, 4, 6, 9, 11], 'b': [112, 114, 115, 190, 250] }) rows_not_in = df1[~df1.isin(df2)].dropna(how='all') print(rows_not_in)
Running the code sample produces the following output.
a b 3 8.0 118.0 4 10.0 120.0
When the how
argument of the dropna()
method is set to "all"
, then if all
values are NA, the corresponding row or column is dropped.
You can learn more about the related topics by checking out the following tutorials: