ValueError: cannot reindex on an axis with duplicate labels

avatar
Borislav Hadzhiev

Last updated: Apr 11, 2024
6 min

banner

# Table of Contents

  1. ValueError: cannot reindex on an axis with duplicate labels
  2. Resetting the index to solve the error
  3. Solving the error when concatenating DataFrames
  4. Removing duplicate columns to solve the error
  5. Solving the error when adding a column to a DataFrame

# ValueError: cannot reindex on an axis with duplicate labels

The Pandas "ValueError: cannot reindex on an axis with duplicate labels" occurs when you join or assign to a column when the index (row or column names) has duplicate values.

To solve the error, remove the rows with duplicate indexes.

Here is an example of how the error occurs when calling DataFrame.reindex()

main.py
import pandas as pd df = pd.DataFrame({ 'id': [112, 113, 114, 115], 'name': ['Alice', 'Bobby', 'Carl', 'Dan'], 'salary': [1500, 2500, 3500, 4500], }, index=[0, 1, 1, 2]) # ๐Ÿ‘ˆ๏ธ non-unique indexes # โ›”๏ธ ValueError: cannot reindex on an axis with duplicate labels df.reindex([0, 1, 2, 3])

cannot reindex on axis with duplicate labels

Notice that the original index contains duplicate values.

You can print the duplicates with df[df.index.duplicated()].

main.py
import pandas as pd df = pd.DataFrame({ 'id': [112, 113, 114, 115], 'name': ['Alice', 'Bobby', 'Carl', 'Dan'], 'salary': [1500, 2500, 3500, 4500], }, index=[0, 1, 1, 2]) # ๐Ÿ‘ˆ๏ธ non-unique indexes # id name salary # 1 114 Carl 3500 print(df[df.index.duplicated()])
The code for this article is available on GitHub

Use the duplicated() method to remove the rows with duplicate indexes before reindexing.

main.py
import pandas as pd df = pd.DataFrame({ 'id': [112, 113, 114, 115], 'name': ['Alice', 'Bobby', 'Carl', 'Dan'], 'salary': [1500, 2500, 3500, 4500], }, index=[0, 1, 1, 2]) # ๐Ÿ‘ˆ๏ธ non-unique indexes # id name salary # 1 114 Carl 3500 print(df[df.index.duplicated()]) df = df[~df.index.duplicated()] # Empty DataFrame # Columns: [id, name, salary] # Index: [] print(df[df.index.duplicated()])

The pandas.index.duplicated method indicates the duplicate index values.

Note that the error is also raised if you have duplicate row or column names.

After removing the rows with duplicate indexes, the duplicated() method returns an empty DataFrame.

You can now safely call the reindex() method.

main.py
import pandas as pd df = pd.DataFrame({ 'id': [112, 113, 114, 115], 'name': ['Alice', 'Bobby', 'Carl', 'Dan'], 'salary': [1500, 2500, 3500, 4500], }, index=[0, 1, 1, 2]) # ๐Ÿ‘ˆ๏ธ non-unique indexes df = df[~df.index.duplicated()] df.reindex([0, 1, 2, 3]) # id name salary # 0 112 Alice 1500 # 1 113 Bobby 2500 # 2 115 Dan 4500 print(df)

remove rows with duplicate indexes

The code for this article is available on GitHub

Notice that the DataFrame doesn't contain the row with the duplicate index.

The DataFrame.reindex() method conforms the DataFrame to the new index.

# Resetting the index to solve the error

You can also solve the error by resetting the index.

main.py
import pandas as pd df = pd.DataFrame({ 'id': [112, 113, 114, 115], 'name': ['Alice', 'Bobby', 'Carl', 'Dan'], 'salary': [1500, 2500, 3500, 4500], }, index=[0, 1, 1, 2]) # ๐Ÿ‘ˆ๏ธ non-unique indexes df = df.reset_index(drop=True) # id name salary # 0 112 Alice 1500 # 1 113 Bobby 2500 # 2 114 Carl 3500 # 3 115 Dan 4500 print(df)

reset the index to solve the error

The code for this article is available on GitHub

The DataFrame.reset_index() method resets the index of the DataFrame.

The DataFrame now uses the default index.

We also set the drop keyword argument to True to reset the index to the default integer index.

# Solving the error when concatenating DataFrames

You might also get indexes with duplicate values when you create a DataFrame by concatenating other DataFrames.

If you don't need to preserve the values of your index, and simply want them to be unique, set the ignore_index keyword argument to True when calling pandas.concat.

main.py
import pandas as pd df1 = pd.DataFrame({'a': 1, 'b': 2}, index=[3]) df2 = pd.DataFrame({'c': 3, 'd': 4}, index=[10]) df3 = pd.concat([df1, df2], ignore_index=True) # a b c d # 0 1.0 2.0 NaN NaN # 1 NaN NaN 3.0 4.0 print(df3)

set ignore index to true

The code for this article is available on GitHub

The pandas.concat() method concatenates pandas objects along a particular axis.

When the ignore_index argument is set to True, the index values along the concatenation axis are not used.

The resulting axis is labeled 0, ..., n - 1.

Setting the ignore_index argument to True is useful when concatenating objects where the concatenation axis doesn't have meaningful indexing information.

Alternatively, you can overwrite the current indexes with df.index = new_index.

main.py
import pandas as pd df1 = pd.DataFrame({'a': 1, 'b': 2}, index=[3]) df2 = pd.DataFrame({'c': 3, 'd': 4}, index=[10]) # ๐Ÿ‘‡๏ธ overwrite the indexes df1.index = [0] df2.index = [1] df3 = pd.concat([df1, df2]) # a b c d # 0 1.0 2.0 NaN NaN # 1 NaN NaN 3.0 4.0 print(df3)

# Removing duplicate columns to solve the error

The error also commonly occurs if you have a DataFrame where multiple columns have the same name.

You can use the df.loc indexer to remove the duplicate columns.

main.py
import pandas as pd df = pd.DataFrame({ 'id': [112, 113, 114, 115], # 'name': ['Alice', 'Bobby', 'Carl', 'Dan'], 'salary': [1500, 2500, 3500, 4500], 'nums': [1500, 2500, 3500, 4500], }, index=[0, 1, 2, 3]) df.columns = ['id', 'salary', 'salary'] # id salary salary # 0 112 1500 1500 # 1 113 2500 2500 # 2 114 3500 3500 # 3 115 4500 4500 print(df) print('-' * 50) # โœ… Remove duplicate column df = df.loc[:, ~df.columns.duplicated()] # id salary # 0 112 1500 # 1 113 2500 # 2 114 3500 # 3 115 4500 print(df)

remove duplicate column names using df loc indexer

The code for this article is available on GitHub

Notice that the DataFrame has 2 columns named salary.

The following line removes the duplicate columns.

main.py
# โœ… Remove duplicate column df = df.loc[:, ~df.columns.duplicated()]

You can also specify which column you want to keep.

When the keep argument is set to first, the first occurrence is kept.

main.py
df = df.loc[:, ~df.columns.duplicated(keep='first')]

When the argument is set to false, the last occurrence is kept.

main.py
df = df.loc[:, ~df.columns.duplicated(keep='last')]

When the argument is set to False, none of the duplicate columns is kept.

main.py
df = df.loc[:, ~df.columns.duplicated(keep=False)]

# Solving the error when adding a column to a DataFrame

You might also get the error when adding a column to a DataFrame if your DataFrame has duplicate indexes.

main.py
import pandas as pd df1 = pd.DataFrame({ 'id': [112, 113, 114, 115], 'name': ['Alice', 'Bobby', 'Carl', 'Dan'], 'salary': [1500, 2500, 3500, 4500], }, index=[0, 1, 2, 3]) df2 = pd.DataFrame({ 'age': [25, 26, 27, 28] }, index=[1, 1, 2, 3]) # ๐Ÿ‘ˆ๏ธ has duplicate index df1['age'] = df2['age'] # โ›”๏ธ ValueError: cannot reindex on an axis with duplicate labels print(df1)

Notice that df2 has a duplicate index.

You can use the DataFrame.values property to solve the error.

main.py
import pandas as pd df1 = pd.DataFrame({ 'id': [112, 113, 114, 115], 'name': ['Alice', 'Bobby', 'Carl', 'Dan'], 'salary': [1500, 2500, 3500, 4500], }, index=[0, 1, 2, 3]) df2 = pd.DataFrame({ 'age': [25, 26, 27, 28] }, index=[1, 1, 2, 3]) # ๐Ÿ‘ˆ๏ธ has duplicate index # โœ… access .values df1['age'] = df2['age'].values # id name salary age # 0 112 Alice 1500 25 # 1 113 Bobby 2500 26 # 2 114 Carl 3500 27 # 3 115 Dan 4500 28 print(df1)

using dataframe values to solve the error

The code for this article is available on GitHub

We used the values property to solve the error.

main.py
# โœ… access .values df1['age'] = df2['age'].values

The DataFrame.values property returns a NumPy ndarray containing the values of the DataFrame.

You can also use the to_numpy() method to solve the error.

main.py
import pandas as pd df1 = pd.DataFrame({ 'id': [112, 113, 114, 115], 'name': ['Alice', 'Bobby', 'Carl', 'Dan'], 'salary': [1500, 2500, 3500, 4500], }, index=[0, 1, 2, 3]) df2 = pd.DataFrame({ 'age': [25, 26, 27, 28] }, index=[1, 1, 2, 3]) # ๐Ÿ‘ˆ๏ธ has duplicate index # โœ… Using to_numpy() method df1['age'] = df2['age'].to_numpy() # id name salary age # 0 112 Alice 1500 25 # 1 113 Bobby 2500 26 # 2 114 Carl 3500 27 # 3 115 Dan 4500 28 print(df1)

using to numpy method to solve the error

The code for this article is available on GitHub

The DataFrame.to_numpy() method converts the DataFrame to a NumPy array.

I've written a detailed guide on how to copy a column from one DataFrame to another in Pandas.

# Additional Resources

You can learn more about the related topics by checking out the following tutorials:

I wrote a book in which I share everything I know about how to become a better, more efficient programmer.
book cover
You can use the search field on my Home Page to filter through all of my articles.

Copyright ยฉ 2024 Borislav Hadzhiev