Pandas: Select distinct across multiple DataFrame columns

avatar
Borislav Hadzhiev

Last updated: Apr 12, 2024
4 min

banner

# Table of Contents

  1. Pandas: Select distinct across multiple DataFrame columns
  2. Select distinct across multiple DataFrame columns using a for loop
  3. Getting the unique values across multiple columns in a single array

# Pandas: Select distinct across multiple DataFrame columns

Use the drop_duplicates() method to "select distinct" across multiple DataFrame columns in Pandas.

The method will return a new DataFrame object with the duplicate rows removed.

main.py
import pandas as pd df = pd.DataFrame({ 'Animal': ['Cat', 'Cat', 'Cat', 'Dog', 'Dog', 'Dog'], 'Max Speed': [25, 25, 40, 45, 45, 65] }) # Animal Max Speed # 0 Cat 25 # 2 Cat 40 # 3 Dog 45 # 5 Dog 65 print(df.drop_duplicates())

select distinct across multiple dataframe columns in pandas

The code for this article is available on GitHub

The code sample shows how to get the unique values across multiple DataFrame columns.

Note: if you need to get the unique values across multiple columns contained in a single array, use the pandas.unique() method.

main.py
import pandas as pd df = pd.DataFrame({ 'Animal': ['Cat', 'Cat', 'Cat', 'Dog', 'Dog', 'Dog'], 'Animal2': ['Cat', 'Dog', 'Lizzard', 'Monkey', 'Lizzard', 'Cat'] }) # ๐Ÿ‘‡๏ธ ['Cat' 'Dog' 'Lizzard' 'Monkey'] print(pd.unique(df[['Animal', 'Animal2']].values.ravel('K')))

The ravel() method returns a multi-dimensional array which we then flatten.

The DataFrame.drop_duplicates() method returns a new DataFrame object with the duplicate rows removed.

By default, the method doesn't mutate the DataFrame in place, so make sure to assign the result of calling it into a variable.

main.py
import pandas as pd df = pd.DataFrame({ 'Animal': ['Cat', 'Cat', 'Cat', 'Dog', 'Dog', 'Dog'], 'Max Speed': [25, 25, 40, 45, 45, 65] }) print(df) print('-' * 50) df = df.drop_duplicates() print(df)
The code for this article is available on GitHub

Running the code sample produces the following output.

shell
Animal Max Speed 0 Cat 25 1 Cat 25 2 Cat 40 3 Dog 45 4 Dog 45 5 Dog 65 -------------------------------------------------- Animal Max Speed 0 Cat 25 2 Cat 40 3 Dog 45 5 Dog 65

get unique values across multiple dataframe columns

Alternatively, you can set the inplace argument to True to mutate the original DataFrame object.

main.py
import pandas as pd df = pd.DataFrame({ 'Animal': ['Cat', 'Cat', 'Cat', 'Dog', 'Dog', 'Dog'], 'Max Speed': [25, 25, 40, 45, 45, 65] }) print(df) print('-' * 50) df.drop_duplicates(inplace=True) print(df)

Running the code sample produces the following output.

shell
Animal Max Speed 0 Cat 25 1 Cat 25 2 Cat 40 3 Dog 45 4 Dog 45 5 Dog 65 -------------------------------------------------- Animal Max Speed 0 Cat 25 2 Cat 40 3 Dog 45 5 Dog 65

When the inplace argument is set to True, the DataFrame is modified in place and None is returned.

By default, the DataFrame.drop_duplicates method considers all of the columns when identifying duplicates.

If you only want to consider some of the columns, supply the subset argument.

main.py
import pandas as pd df = pd.DataFrame({ 'Animal': ['Cat', 'Cat', 'Cat', 'Dog', 'Dog', 'Dog'], 'Max Speed': [25, 25, 40, 45, 45, 65] }) print(df) print('-' * 50) df.drop_duplicates(subset=['Animal'], inplace=True) print(df)
The code for this article is available on GitHub

Running the code sample produces the following output.

shell
Animal Max Speed 0 Cat 25 1 Cat 25 2 Cat 40 3 Dog 45 4 Dog 45 5 Dog 65 -------------------------------------------------- Animal Max Speed 0 Cat 25 3 Dog 45

We only passed the Animal column to the subset list, so the method will only consider the specified column for identifying duplicates.

# Select distinct across multiple DataFrame columns using a for loop

You can also use a for loop with pandas.unique() to select the distinct values across multiple DataFrame columns.

main.py
import pandas as pd df = pd.DataFrame({ 'Animal': ['Cat', 'Cat', 'Cat', 'Dog', 'Dog', 'Dog'], 'Max Speed': [25, 25, 40, 45, 45, 65] }) a_dict = {} for column in df: a_dict[column] = df[column].unique() # {'Animal': array(['Cat', 'Dog'], dtype=object), # 'Max Speed': array([25, 40, 45, 65])} print(a_dict)
The code for this article is available on GitHub

We used a for loop to iterate over the DataFrame.

On each iteration, we use the pandas.unique() method to get an array that stores the unique values for the current column.

You should also use the unique() method if you need to find the unique values in a specific column.

main.py
import pandas as pd df = pd.DataFrame({ 'Animal': ['Cat', 'Cat', 'Cat', 'Dog', 'Dog', 'Dog'], 'Max Speed': [25, 25, 40, 45, 45, 65] }) print(df['Animal'].unique()) # ['Cat' 'Dog'] print(df['Max Speed'].unique()) # [25 40 45 65]

Note: if you need to get the unique values across multiple columns contained in a single array, use the pandas.unique() method.

main.py
import pandas as pd df = pd.DataFrame({ 'Animal': ['Cat', 'Cat', 'Cat', 'Dog', 'Dog', 'Dog'], 'Animal2': ['Cat', 'Dog', 'Lizzard', 'Monkey', 'Lizzard', 'Cat'] }) # ๐Ÿ‘‡๏ธ ['Cat' 'Dog' 'Lizzard' 'Monkey'] print(pd.unique(df[['Animal', 'Animal2']].values.ravel('K')))

# Getting the unique values across multiple columns in a single array

You can also use the numpy.unique() method if you need to get the unique values across multiple columns in a single array.

First, make sure you have the numpy module installed.

Open your terminal and run the following command.

shell
pip install numpy # or with pip3 pip3 install numpy

Now, import the module and use numpy.unique().

main.py
import pandas as pd import numpy as np df = pd.DataFrame({ 'Animal': ['Cat', 'Cat', 'Cat', 'Dog', 'Dog', 'Dog'], 'Animal2': ['Cat', 'Dog', 'Lizzard', 'Monkey', 'Lizzard', 'Cat'] }) # ๐Ÿ‘‡๏ธ ['Cat' 'Dog' 'Lizzard' 'Monkey'] print(np.unique(df[['Animal', 'Animal2']].values))

get unique values across multiple dataframe columns in single array

The code for this article is available on GitHub

We used bracket notation to select the Animal and Animal2 columns and passed the resulting DataFrame to numpy.unique().

The numpy.unique method takes an array-like object and returns the unique elements of the array.

# Additional Resources

You can learn more about the related topics by checking out the following tutorials:

I wrote a book in which I share everything I know about how to become a better, more efficient programmer.
book cover
You can use the search field on my Home Page to filter through all of my articles.

Copyright ยฉ 2024 Borislav Hadzhiev