Last updated: Apr 12, 2024
Reading time·4 min
To change the column type to Categorical in Pandas:
astype()
method on the selected column, passing it "category"
as
a parameter.import pandas as pd df = pd.DataFrame({ 'name': ['Alice', 'Bobby', 'Carl', 'Dan'], 'experience': [1, 5, 3, 8], 'salary': [189.1, 180.2, 190.3, 205.4], }) df['name'] = df['name'].astype('category') print(df) print('-' * 50) print(df.dtypes)
Running the code sample produces the following output.
name experience salary 0 Alice 1 189.1 1 Bobby 5 180.2 2 Carl 3 190.3 3 Dan 8 205.4 -------------------------------------------------- name category experience int64 salary float64 dtype: object
We used bracket notation to select the name
column and called the
astype
method on it.
df['name'] = df['name'].astype('category')
The astype()
method casts a pandas object to the supplied data type.
You can access the
dtypes
attribute on the DataFrame
to verify that the column has been converted to
category
.
# name category # experience int64 # salary float64 # dtype: object print(df.dtypes)
The dtypes
attribute returns the data types in the DataFrame
.
To be more precise, a Series
with the data type of each column is returned.
You can use the same approach if you need to change the data type of multiple columns to Categorical.
import pandas as pd df = pd.DataFrame({ 'name': ['Alice', 'Bobby', 'Carl', 'Dan'], 'experience': [1, 5, 3, 8], 'salary': [189.1, 180.2, 190.3, 205.4], }) columns = ['name', 'experience'] df[columns] = df[columns].astype('category') # name category # experience category # salary float64 # dtype: object print(df.dtypes)
We passed a list of columns between the square brackets []
to change the type
of the name
and experience
columns to category
.
You can also specify the column names inline.
import pandas as pd df = pd.DataFrame({ 'name': ['Alice', 'Bobby', 'Carl', 'Dan'], 'experience': [1, 5, 3, 8], 'salary': [189.1, 180.2, 190.3, 205.4], }) df[['name', 'experience']] = df[['name', 'experience']].astype('category') # name category # experience category # salary float64 # dtype: object print(df.dtypes)
However, notice that we have 2 sets of square brackets next to one another.
In older versions of Pandas, you used to use a for loop to change the type of multiple columns.
import pandas as pd df = pd.DataFrame({ 'name': ['Alice', 'Bobby', 'Carl', 'Dan'], 'experience': [1, 5, 3, 8], 'salary': [189.1, 180.2, 190.3, 205.4], }) columns = ['name', 'experience'] for column in columns: df[column] = df[column].astype('category') # name category # experience category # salary float64 # dtype: object print(df.dtypes)
However, in recent Pandas versions, iterating over the columns collection is not necessary.
You might also see examples online that use a lambda function.
import pandas as pd df = pd.DataFrame({ 'name': ['Alice', 'Bobby', 'Carl', 'Dan'], 'experience': [1, 5, 3, 8], 'salary': [189.1, 180.2, 190.3, 205.4], }) columns = ['name', 'experience'] df[columns] = df[columns].apply(lambda x: x.astype('category')) # name category # experience category # salary float64 # dtype: object print(df.dtypes)
The lambda function gets called with each column name and sets its type to
category
.
However, using a lambda is not needed in recent Pandas versions, as you can directly specify the list of columns between the square brackets.
import pandas as pd df = pd.DataFrame({ 'name': ['Alice', 'Bobby', 'Carl', 'Dan'], 'experience': [1, 5, 3, 8], 'salary': [189.1, 180.2, 190.3, 205.4], }) columns = ['name', 'experience'] df[columns] = df[columns].astype('category') # name category # experience category # salary float64 # dtype: object print(df.dtypes)
If you need to convert all columns, except for columns that have a specific data
type to categorical, use the select_dtypes()
method.
import pandas as pd df = pd.DataFrame({ 'id': ['a', 'b', 'c', 'd'], 'name': ['Alice', 'Bobby', 'Carl', 'Dan'], 'experience': [1, 5, 3, 8], 'salary': [189.1, 180.2, 190.3, 205.4], }) columns = df.select_dtypes(exclude='int').columns.to_list() df[columns] = df[columns].astype('category') # id category # name category # experience int64 # salary category # dtype: object print(df.dtypes)
The DataFrame.select_dtypes() method returns a subset of the DataFrame's columns based on the column data types.
We excluded the int
column (experience
) and converted all other columns to
categorical.
I've also written an article on how to get a list of categories or categorical columns.
You can learn more about the related topics by checking out the following tutorials: