Columns have mixed types. Specify dtype option on import

# Table of Contents

# Columns have mixed types. Specify dtype option on import

The Pandas warning "Columns have mixed types. Specify dtype option on import or set low_memory=False." occurs when your CSV file contains columns that have mixed types and cannot be inferred reliably.

To resolve the issue, specify the data type of each column explicitly by supplying the dtype argument.

For example, suppose you have the following employees.csv file.

employees.csv

Copied!
first_name,last_name,date,salary
Alice,Smith,01/21/1995,1500
Bobby,Hadz,04/14/1998,abc
Carl,Lemon,06/11/1994,3000
Dean,Berry,06/11/1996,xyz

Notice that the salary column contains integers and strings.

The best way of solving the issue where pandas tries to guess the data type of the column is to explicitly specify it.

main.py

Copied!
import pandas as pd

dtype = {
    'first_name': str,
    'last_name': str,
    'date': str,
    'salary': str
}

df = pd.read_csv(
    'employees.csv',
    sep=',',
    encoding='utf-8',
    dtype=dtype

)

#   first_name last_name        date salary
# 0      Alice     Smith  01/21/1995   1500
# 1      Bobby      Hadz  04/14/1998    abc
# 2       Carl     Lemon  06/11/1994   3000
# 3       Dean     Berry  06/11/1996    xyz
print(df)

explicitly set dtype of each column

The code for this article is available on GitHub

The dtype argument of the pandas.read_csv() method is a dictionary of column name -> type.

Here is another example.

main.py

Copied!
import numpy as np

dtype={'a': np.float64, 'b': np.int32, 'c': 'Int64'}

You can use the str or object types to preserve and not interpret the dtype.

Pandas can only infer the data type of a column once the entire column is read.

In other words, pandas can't start parsing the data in the column until all of it is read which is very inefficient.

For example, if we had 5 million rows in our employee.csv file with a salary column that only contains numbers, pandas can't know that the column only contains numbers until it has read all 5 million rows.

Note that the salary column in the CSV file doesn't contain only integers.

employees.csv

Copied!
first_name,last_name,date,salary
Alice,Smith,01/21/1995,1500
Bobby,Hadz,04/14/1998,abc
Carl,Lemon,06/11/1994,3000
Dean,Berry,06/11/1996,xyz

So if you try to set the dtype of the column to int, you'd get an error:

ValueError: invalid literal for int() with base 10

main.py

Copied!
import pandas as pd

dtype = {
    'first_name': str,
    'last_name': str,
    'date': str,
    'salary': int # 👈️ set dtype to int
}

# ⛔️ ValueError: invalid literal for int() with base 10: 'abc'
df = pd.read_csv(
    'employees.csv',
    sep=',',
    encoding='utf-8',
    dtype=dtype

)

# Silencing the warning by setting `dtype` to `object`

If you just want to silence the warning, set the dtype of the column to object.

main.py

Copied!
import pandas as pd

dtype = {
    'first_name': str,
    'last_name': str,
    'date': str,
    'salary': object
}

df = pd.read_csv(
    'employees.csv',
    sep=',',
    encoding='utf-8',
    dtype=dtype

)

#   first_name last_name        date salary
# 0      Alice     Smith  01/21/1995   1500
# 1      Bobby      Hadz  04/14/1998    abc
# 2       Carl     Lemon  06/11/1994   3000
# 3       Dean     Berry  06/11/1996    xyz
print(df)

The code for this article is available on GitHub

Setting the dtype of the column to object silences the warning but doesn't make the read_csv() method more memory efficient.

# Silencing the warning by setting `low_memory` to `False`

You can also silence the warning by setting low_memory to False in the call to pandas.read_csv().

main.py

Copied!
import pandas as pd

df = pd.read_csv(
    'employees.csv',
    sep=',',
    encoding='utf-8',
    low_memory=False

)

#   first_name last_name        date salary
# 0      Alice     Smith  01/21/1995   1500
# 1      Bobby      Hadz  04/14/1998    abc
# 2       Carl     Lemon  06/11/1994   3000
# 3       Dean     Berry  06/11/1996    xyz
print(df)

silence warning by setting low memory false

The code for this article is available on GitHub

By default, the low_memory argument is set to True.

When the argument is set to True, the file is processed in chunks, resulting in lower memory consumption while parsing, but possibly mixed type inference.

To ensure no mixed types, you can set the low_memory argument to False or explicitly set the dtype parameter as we did in the previous subheading.

# Using the `unicode` data type instead

You can also silence the warning by using the unicode data type.

main.py

Copied!
import pandas as pd
import numpy as np

df = pd.read_csv(
    'employees.csv',
    sep=',',
    encoding='utf-8',
    index_col=False,
    dtype=np.dtype('unicode'),
)

#   first_name last_name        date salary
# 0      Alice     Smith  01/21/1995   1500
# 1      Bobby      Hadz  04/14/1998    abc
# 2       Carl     Lemon  06/11/1994   3000
# 3       Dean     Berry  06/11/1996    xyz
print(df)

The code for this article is available on GitHub

Make sure you have the NumPy module installed by running the following command.

shell

Copied!
pip install numpy

# or with pip3
pip3 install numpy

We set the data type to Unicode strings.

When index_col is set to False, pandas won't use the first column as the index (e.g. when you have a malformed CSV file with delimiters at the end of each line).

# Explicitly setting the engine to `python`

Another thing you can try is to set the engine argument to python when calling pandas.read_csv.

main.py

Copied!
import pandas as pd


df = pd.read_csv(
    'employees.csv',
    sep=',',
    encoding='utf-8',
    engine='python'
)

#   first_name last_name        date salary
# 0      Alice     Smith  01/21/1995   1500
# 1      Bobby      Hadz  04/14/1998    abc
# 2       Carl     Lemon  06/11/1994   3000
# 3       Dean     Berry  06/11/1996    xyz
print(df)

The code for this article is available on GitHub

The engine argument is the parser engine that should be used.

The available engines are c, pyarrow and python.

The c and pyarrow engines are faster, however, the python engine is more feature-complete.

# Using converters to resolve the warning

You can also use converters to resolve the warning.

Suppose you have the following employees.csv file.

employees.csv

Copied!
first_name,last_name,date,salary
Alice,Smith,01/21/1995,1500
Bobby,Hadz,04/14/1998,abc
Carl,Lemon,06/11/1994,3000
Dean,Berry,06/11/1996,xyz

Notice that the salary column contains integers and strings.

Here is how we can use a converter to initialize the non-numeric salary values to 0.

main.py

Copied!
import pandas as pd
import numpy as np


def converter1(value):
    if not value:
        return 0
    try:
        return np.float64(value)
    except Exception as _e:
        return np.float64(0)


df = pd.read_csv(
    'employees.csv',
    sep=',',
    encoding='utf-8',
    converters={'salary': converter1}
)

#   first_name last_name        date  salary
# 0      Alice     Smith  01/21/1995  1500.0
# 1      Bobby      Hadz  04/14/1998     0.0
# 2       Carl     Lemon  06/11/1994  3000.0
# 3       Dean     Berry  06/11/1996     0.0
print(df)

The code for this article is available on GitHub

The example only uses a converter for the salary column, however, you can specify as many converters as necessary in the converters dictionary.

The converter1 function gets called with each field from the salary column.

If the supplied value is empty, we return 0.

The try block tries to convert the value to a NumPy float.

If an error is raised the except block runs where we return a 0 value.

# Additional Resources

You can learn more about the related topics by checking out the following tutorials:

I wrote a book in which I share everything I know about how to become a better, more efficient programmer.

You can use the search field on my Home Page to filter through all of my articles.

Columns have mixed types. Specify dtype option on import

# Table of Contents