Python: Compare two CSV files and print the differences

avatar
Borislav Hadzhiev

Last updated: Apr 13, 2024
5 min

banner

# Table of Contents

  1. Python: Compare two CSV files and print the differences
  2. Using the csv-diff package to compare two CSV files in Python
  3. Using the pandas package to compare two CSV files in Python
  4. Using the pandas package with DataFrame.merge()

# Python: Compare two CSV files and print the differences

To compare two CSV files and print the differences in Python:

  1. Use the with open() statement to open the two CSV files.
  2. Read the lines of each file and store the results in two variables.
  3. Iterate over the lines of the second file and check if each line is not contained in the first file.

Here are the contents of the first CSV file, named csv-file-1.csv.

csv-file-1.csv
Alice,Smith,500 Bob,Smith,600 Carl,Smith,400

And here are the contents of the second CSV file, named csv-file-2.csv.

csv-file-2.csv
Alice,Smith,500 Bob,Smith,600 Dan,Smith,2500

Here is the related Python script, named main.py.

main.py
with open( 'csv-file-1.csv', encoding='utf-8' ) as file_1, open( 'csv-file-2.csv', encoding='utf-8' ) as file_2: file_1_lines = file_1.readlines() file_2_lines = file_2.readlines() with open( 'differences.csv', 'w', encoding='utf-8' ) as differences_file: for line in file_2_lines: if line not in file_1_lines: print(line) differences_file.write(line)

compare two csv files and print differences in python

The code for this article is available on GitHub

We used the with open() statement to open the two CSV files.

The with open() statement takes care of closing the files after we're done (even if an exception has occurred).

The file.readlines() method reads the lines of the file into a Python list.

main.py
with open( 'csv-file-1.csv', encoding='utf-8' ) as file_1, open( 'csv-file-2.csv', encoding='utf-8' ) as file_2: file_1_lines = file_1.readlines() # 👇️ ['Alice,Smith,500\n', 'Bob,Smith,600\n', 'Carl,Smith,400'] print(file_1_lines) file_2_lines = file_2.readlines() # 👇️ ['Alice,Smith,500\n', 'Bob,Smith,600\n', 'Dan,Smith,2500\n'] print(file_2_lines)

Once we have the two lists of lines, we use the with open() statement to open a differences.csv file in w (write) mode.

main.py
with open( 'differences.csv', 'w', encoding='utf-8' ) as differences_file: for line in file_2_lines: if line not in file_1_lines: print(line) differences_file.write(line)
The code for this article is available on GitHub

The for loop iterates over the lines of csv-file-2 and.

On each iteration, we use the not in operator to check if the current line is NOT present in the lines of csv-file-1.

If the condition is met, we print the line and write it to the output CSV file.

You can open the differences.csv file (located in the same directory) to see lines in csv-file-2 that are not present in csv-file-1.

csv file with differences

# Using the csv-diff package to compare two CSV files in Python

You can also use the csv-diff package to compare two CSV files in Python.

First, install the module by running the following command.

shell
pip install csv-diff # or with pip3 pip3 install csv-diff

Here are the contents of the first CSV file, named csv-file-1.csv.

csv-file-1.csv
first,last,salary Alice,Smith,500 Bob,Smith,600 Carl,Smith,400

And here are the contents of the second CSV file, named csv-file-2.csv.

csv-file-2.csv
first,last,salary Alice,Smith,500 Bob,Smith,600 Dan,Smith,2500

Here is the related Python script, named main.py.

main.py
from csv_diff import load_csv, compare diff = compare( load_csv(open('csv-file-1.csv'), 'first'), load_csv(open('csv-file-2.csv'), 'first'), ) # {'added': [{'first': 'Dan', 'last': 'Smith', 'salary': '2500'}], 'removed': [{'first': 'Carl', 'last': 'Smith', 'salary': '400'}], 'changed': [], 'columns_added': [], 'columns_removed': []} print(diff) print('-' * 50) # [{'first': 'Dan', 'last': 'Smith', 'salary': '2500'}] print(diff['added']) print('-' * 50) # [{'first': 'Carl', 'last': 'Smith', 'salary': '400'}] print(diff['removed'])

compare two csv files and print differences using csv diff

The code for this article is available on GitHub

You can also use the module via the command line.

The key parameter is the name of the column by which you want to compare.

shell
csv-diff csv-file-1.csv csv-file-2.csv --key=first

comapre two csv files using command line in python

You can also use the --json option if you want to get the output as JSON.

shell
csv-diff csv-file-1.csv csv-file-2.csv --key=first --json

get differences output as json

# Using the pandas package to compare two CSV files in Python

You can also use the pandas package to compare two CSV files in Python.

First, make sure you have the pandas module installed.

shell
pip install pandas # or with pip3 pip3 install pandas

Here are the contents of the first CSV file, named csv-file-1.csv.

csv-file-1.csv
first,last,salary Alice,Smith,500 Bob,Smith,600 Carl,Smith,400

And here are the contents of the second CSV file, named csv-file-2.csv.

csv-file-2.csv
first,last,salary Alice,Smith,500 Bob,Smith,600 Dan,Smith,2500

Here is the related Python script, named main.py.

main.py
import pandas as pd df1 = pd.read_csv('csv-file-1.csv') print(df1) print('-' * 50) df2 = pd.read_csv('csv-file-2.csv') print(df2) print('-' * 50) df3 = df1.compare(df2) print(df3)

Running the code sample produces the following output.

shell
first last salary 0 Alice Smith 500 1 Bob Smith 600 2 Carl Smith 400 -------------------------------------------------- first last salary 0 Alice Smith 500 1 Bob Smith 600 2 Dan Smith 2500 -------------------------------------------------- first salary self other self other 2 Carl Dan 400.0 2500.0

compare two csv files using pandas

If your CSV files don't start with the column names, you can supply them when calling pandas.read_csv.

Here are the updated contents of the first CSV file, named csv-file-1.csv.

csv-file-1.csv
Alice,Smith,500 Bob,Smith,600 Carl,Smith,400

And here are the updated contents of the second CSV file, named csv-file-2.csv.

csv-file-2.csv
Alice,Smith,500 Bob,Smith,600 Dan,Smith,2500

Here is the related Python code.

main.py
import pandas as pd df1 = pd.read_csv('csv-file-1.csv', names=['first', 'last', 'salary']) print(df1) print('-' * 50) df2 = pd.read_csv('csv-file-2.csv', names=['first', 'last', 'salary']) print(df2) print('-' * 50) df3 = df1.compare(df2) print(df3)
The code for this article is available on GitHub

Running the code sample produces the following output.

shell
first last salary 0 Alice Smith 500 1 Bob Smith 600 2 Carl Smith 400 -------------------------------------------------- first last salary 0 Alice Smith 500 1 Bob Smith 600 2 Dan Smith 2500 -------------------------------------------------- first salary self other self other 2 Carl Dan 400.0 2500.0

if csv files dont start with column names

The code sample uses the pandas.read_csv() method to read the CSV files into DataFrame objects and the DataFrame.compare() method to compare the two DataFrames.

The DataFrame.compare() method returns a DataFrame that shows the differences stacked side by side.

The resulting DataFrame has a MultiIndex with self and other stacked alternately at the inner level.

# Using the pandas package with DataFrame.merge()

You can also use the pandas package with DataFrame.merge() to compare two CSV files.

Here are the contents of the first CSV file, named csv-file-1.csv.

csv-file-1.csv
Alice,Smith,500 Bob,Smith,600 Carl,Smith,400

And here are the contents of the second CSV file, named csv-file-2.csv.

csv-file-2.csv
first,last,salary Alice,Smith,500 Bob,Smith,600 Dan,Smith,2500

Here is the related Python script, named main.py.

main.py
import pandas as pd df1 = pd.read_csv('csv-file-1.csv') print(df1) print('-' * 50) df2 = pd.read_csv('csv-file-2.csv') print(df2) print('-' * 50) df3 = df1.merge(df2, indicator=True, how='outer') print(df3)

Running the code sample produces the following output.

shell
first last salary 0 Alice Smith 500 1 Bob Smith 600 2 Carl Smith 400 -------------------------------------------------- first last salary 0 Alice Smith 500 1 Bob Smith 600 2 Dan Smith 2500 -------------------------------------------------- first last salary _merge 0 Alice Smith 500 both 1 Bob Smith 600 both 2 Carl Smith 400 left_only 3 Dan Smith 2500 right_only

The DataFrame.merge() method merges the DataFrame objects with a database-style join.

The second to last row in the resulting DataFrame is only contained in df1 (left_only).

The last row is only contained in df2 (right_only).

You can only get the left_only and right_only rows with a filter.

main.py
import pandas as pd df1 = pd.read_csv('csv-file-1.csv') print(df1) print('-' * 50) df2 = pd.read_csv('csv-file-2.csv') print(df2) print('-' * 50) df3 = df1.merge(df2, indicator=True, how='outer') df3 = df3[df3['_merge'] != 'both'] print(df3)
The code for this article is available on GitHub

Running the code sample produces the following output.

shell
first last salary 0 Alice Smith 500 1 Bob Smith 600 2 Carl Smith 400 -------------------------------------------------- first last salary 0 Alice Smith 500 1 Bob Smith 600 2 Dan Smith 2500 -------------------------------------------------- first last salary _merge 2 Carl Smith 400 left_only 3 Dan Smith 2500 right_only

only get the differences

# Additional Resources

You can learn more about the related topics by checking out the following tutorials:

I wrote a book in which I share everything I know about how to become a better, more efficient programmer.
book cover
You can use the search field on my Home Page to filter through all of my articles.