Columns have mixed types. Specify dtype option on import

avatar
Borislav Hadzhiev

Last updated: Apr 11, 2024
5 min

banner

# Table of Contents

  1. Columns have mixed types. Specify dtype option on import
  2. Silencing the warning by setting dtype to object
  3. Silencing the warning by setting low_memory to False
  4. Using the unicode data type instead
  5. Explicitly setting the engine to python
  6. Using converters to resolve the warning

# Columns have mixed types. Specify dtype option on import

The Pandas warning "Columns have mixed types. Specify dtype option on import or set low_memory=False." occurs when your CSV file contains columns that have mixed types and cannot be inferred reliably.

To resolve the issue, specify the data type of each column explicitly by supplying the dtype argument.

For example, suppose you have the following employees.csv file.

employees.csv
first_name,last_name,date,salary Alice,Smith,01/21/1995,1500 Bobby,Hadz,04/14/1998,abc Carl,Lemon,06/11/1994,3000 Dean,Berry,06/11/1996,xyz

Notice that the salary column contains integers and strings.

The best way of solving the issue where pandas tries to guess the data type of the column is to explicitly specify it.

main.py
import pandas as pd dtype = { 'first_name': str, 'last_name': str, 'date': str, 'salary': str } df = pd.read_csv( 'employees.csv', sep=',', encoding='utf-8', dtype=dtype ) # first_name last_name date salary # 0 Alice Smith 01/21/1995 1500 # 1 Bobby Hadz 04/14/1998 abc # 2 Carl Lemon 06/11/1994 3000 # 3 Dean Berry 06/11/1996 xyz print(df)

explicitly set dtype of each column

The code for this article is available on GitHub

The dtype argument of the pandas.read_csv() method is a dictionary of column name -> type.

Here is another example.

main.py
import numpy as np dtype={'a': np.float64, 'b': np.int32, 'c': 'Int64'}

You can use the str or object types to preserve and not interpret the dtype.

Pandas can only infer the data type of a column once the entire column is read.

In other words, pandas can't start parsing the data in the column until all of it is read which is very inefficient.

For example, if we had 5 million rows in our employee.csv file with a salary column that only contains numbers, pandas can't know that the column only contains numbers until it has read all 5 million rows.

Note that the salary column in the CSV file doesn't contain only integers.

employees.csv
first_name,last_name,date,salary Alice,Smith,01/21/1995,1500 Bobby,Hadz,04/14/1998,abc Carl,Lemon,06/11/1994,3000 Dean,Berry,06/11/1996,xyz

So if you try to set the dtype of the column to int, you'd get an error:

main.py
import pandas as pd dtype = { 'first_name': str, 'last_name': str, 'date': str, 'salary': int # 👈️ set dtype to int } # ⛔️ ValueError: invalid literal for int() with base 10: 'abc' df = pd.read_csv( 'employees.csv', sep=',', encoding='utf-8', dtype=dtype )

# Silencing the warning by setting dtype to object

If you just want to silence the warning, set the dtype of the column to object.

main.py
import pandas as pd dtype = { 'first_name': str, 'last_name': str, 'date': str, 'salary': object } df = pd.read_csv( 'employees.csv', sep=',', encoding='utf-8', dtype=dtype ) # first_name last_name date salary # 0 Alice Smith 01/21/1995 1500 # 1 Bobby Hadz 04/14/1998 abc # 2 Carl Lemon 06/11/1994 3000 # 3 Dean Berry 06/11/1996 xyz print(df)
The code for this article is available on GitHub

Setting the dtype of the column to object silences the warning but doesn't make the read_csv() method more memory efficient.

# Silencing the warning by setting low_memory to False

You can also silence the warning by setting low_memory to False in the call to pandas.read_csv().

main.py
import pandas as pd df = pd.read_csv( 'employees.csv', sep=',', encoding='utf-8', low_memory=False ) # first_name last_name date salary # 0 Alice Smith 01/21/1995 1500 # 1 Bobby Hadz 04/14/1998 abc # 2 Carl Lemon 06/11/1994 3000 # 3 Dean Berry 06/11/1996 xyz print(df)

silence warning by setting low memory false

The code for this article is available on GitHub

By default, the low_memory argument is set to True.

When the argument is set to True, the file is processed in chunks, resulting in lower memory consumption while parsing, but possibly mixed type inference.

To ensure no mixed types, you can set the low_memory argument to False or explicitly set the dtype parameter as we did in the previous subheading.

# Using the unicode data type instead

You can also silence the warning by using the unicode data type.

main.py
import pandas as pd import numpy as np df = pd.read_csv( 'employees.csv', sep=',', encoding='utf-8', index_col=False, dtype=np.dtype('unicode'), ) # first_name last_name date salary # 0 Alice Smith 01/21/1995 1500 # 1 Bobby Hadz 04/14/1998 abc # 2 Carl Lemon 06/11/1994 3000 # 3 Dean Berry 06/11/1996 xyz print(df)
The code for this article is available on GitHub

Make sure you have the NumPy module installed by running the following command.

shell
pip install numpy # or with pip3 pip3 install numpy

We set the data type to Unicode strings.

When index_col is set to False, pandas won't use the first column as the index (e.g. when you have a malformed CSV file with delimiters at the end of each line).

# Explicitly setting the engine to python

Another thing you can try is to set the engine argument to python when calling pandas.read_csv.

main.py
import pandas as pd df = pd.read_csv( 'employees.csv', sep=',', encoding='utf-8', engine='python' ) # first_name last_name date salary # 0 Alice Smith 01/21/1995 1500 # 1 Bobby Hadz 04/14/1998 abc # 2 Carl Lemon 06/11/1994 3000 # 3 Dean Berry 06/11/1996 xyz print(df)
The code for this article is available on GitHub

The engine argument is the parser engine that should be used.

The available engines are c, pyarrow and python.

The c and pyarrow engines are faster, however, the python engine is more feature-complete.

# Using converters to resolve the warning

You can also use converters to resolve the warning.

Suppose you have the following employees.csv file.

employees.csv
first_name,last_name,date,salary Alice,Smith,01/21/1995,1500 Bobby,Hadz,04/14/1998,abc Carl,Lemon,06/11/1994,3000 Dean,Berry,06/11/1996,xyz

Notice that the salary column contains integers and strings.

Here is how we can use a converter to initialize the non-numeric salary values to 0.

main.py
import pandas as pd import numpy as np def converter1(value): if not value: return 0 try: return np.float64(value) except Exception as _e: return np.float64(0) df = pd.read_csv( 'employees.csv', sep=',', encoding='utf-8', converters={'salary': converter1} ) # first_name last_name date salary # 0 Alice Smith 01/21/1995 1500.0 # 1 Bobby Hadz 04/14/1998 0.0 # 2 Carl Lemon 06/11/1994 3000.0 # 3 Dean Berry 06/11/1996 0.0 print(df)
The code for this article is available on GitHub

The example only uses a converter for the salary column, however, you can specify as many converters as necessary in the converters dictionary.

The converter1 function gets called with each field from the salary column.

If the supplied value is empty, we return 0.

The try block tries to convert the value to a NumPy float.

If an error is raised the except block runs where we return a 0 value.

# Additional Resources

You can learn more about the related topics by checking out the following tutorials:

I wrote a book in which I share everything I know about how to become a better, more efficient programmer.
book cover
You can use the search field on my Home Page to filter through all of my articles.