Pandas: Get a List of Categories or Categorical Columns

avatar
Borislav Hadzhiev

Last updated: Apr 12, 2024
4 min

banner

# Table of Contents

  1. Pandas: Get a List of Categories or Categorical Columns
  2. Get a List of the Categories in a Category column in Pandas
  3. Getting the categorical columns in a DataFrame with _get_numeric_data()
  4. Checking if a specific DataFrame column is Categorical

# Pandas: Get a List of Categories or Categorical Columns

To get the categorical columns in a DataFrame:

  1. Call the select_dtypes() method on the DataFrame.
  2. Set the include() argument to "category".
  3. The method will return a DataFrame containing only the categorical columns.
main.py
import pandas as pd df = pd.DataFrame({ 'id': pd.Categorical(['a', 'b', 'c', 'd']), 'name': pd.Categorical(['Alice', 'Bobby', 'Carl', 'Dan']), 'experience': [1, 5, 3, 8], 'salary': [189.1, 180.2, 190.3, 205.4], }) print(df.select_dtypes(include=['category'])) print('-' * 50) print(df['name'].cat.categories)
The code for this article is available on GitHub

Running the code sample produces the following output.

shell
id name 0 a Alice 1 b Bobby 2 c Carl 3 d Dan -------------------------------------------------- Index(['Alice', 'Bobby', 'Carl', 'Dan'], dtype='object')

get list of categories or categorical columns

The DataFrame.select_dtypes method returns a subset of a DataFrame's columns based on the column data types.

To only select the categorical columns, we set the include argument to "category".

main.py
# id name # 0 a Alice # 1 b Bobby # 2 c Carl # 3 d Dan print(df.select_dtypes(include=['category']))

The include argument can be set to a selection of dtypes or strings to be included.

You can also specify multiple columns in the include list.

main.py
print(df.select_dtypes(include=['category', 'object']))

If the columns you're looking for don't get listed, try adding the object type as shown in the code sample.

There is also an exclude argument that does the opposite.

main.py
import pandas as pd df = pd.DataFrame({ 'id': pd.Categorical(['a', 'b', 'c', 'd']), 'name': pd.Categorical(['Alice', 'Bobby', 'Carl', 'Dan']), 'experience': [1, 5, 3, 8], 'salary': [189.1, 180.2, 190.3, 205.4], }) # id name # 0 a Alice # 1 b Bobby # 2 c Carl # 3 d Dan print(df.select_dtypes(exclude=['number', 'bool_', 'object_']))

using exclude parameter to get categorical columns

The code for this article is available on GitHub

# Get a List of the Categories in a Category column in Pandas

If you need to get a list of the categories in a Category column:

  1. Select the category column using bracket notation.
  2. Access the cat.categories attribute on the selected column.
main.py
import pandas as pd df = pd.DataFrame({ 'id': pd.Categorical(['a', 'b', 'c', 'd']), 'name': pd.Categorical(['Alice', 'Bobby', 'Carl', 'Dan']), 'experience': [1, 5, 3, 8], 'salary': [189.1, 180.2, 190.3, 205.4], }) # Index(['a', 'b', 'c', 'd'], dtype='object') print(df['id'].cat.categories) print('-' * 50) # Index(['Alice', 'Bobby', 'Carl', 'Dan'], dtype='object') print(df['name'].cat.categories)

get list of categories in category column

The code for this article is available on GitHub

The cat.categories() method returns the categories of the given categorical column.

The method returns an Index object, so if you want to get the result as a list, use the tolist() method.

main.py
import pandas as pd df = pd.DataFrame({ 'id': pd.Categorical(['a', 'b', 'c', 'd']), 'name': pd.Categorical(['Alice', 'Bobby', 'Carl', 'Dan']), 'experience': [1, 5, 3, 8], 'salary': [189.1, 180.2, 190.3, 205.4], }) # ['a', 'b', 'c', 'd'] print(df['id'].cat.categories.tolist()) print('-' * 50) # ['Alice', 'Bobby', 'Carl', 'Dan'] print(df['name'].cat.categories.tolist())

convert the index to a list

The code for this article is available on GitHub

The index.tolist() method returns a list of the values in the index.

# Getting the categorical columns in a DataFrame with _get_numeric_data()

If your DataFrame doesn't have any numerical columns that are categorical, you can also get the categorical columns using _get_numeric_data().

main.py
import pandas as pd df = pd.DataFrame({ 'id': pd.Categorical(['a', 'b', 'c', 'd']), 'name': pd.Categorical(['Alice', 'Bobby', 'Carl', 'Dan']), 'experience': [1, 5, 3, 8], 'salary': [189.1, 180.2, 190.3, 205.4], }) numeric_columns = df._get_numeric_data().columns # ๐Ÿ‘‡๏ธ Index(['experience', 'salary'], dtype='object') print(numeric_columns) categorical_columns = list(set(df.columns) - set(numeric_columns)) print(categorical_columns) # ๐Ÿ‘‰๏ธ ['name', 'id']
The code for this article is available on GitHub

We used the _get_numeric_data() method to get all numeric columns in the DataFrame.

The last step is to subtract the numeric columns from all of the DataFrame's columns and convert the result to a list.

We used the set() constructor to convert the index objects to Set objects to be able to use the subtraction (-) operator.

We could've achieved the same result by using the set.difference() method.

main.py
import pandas as pd df = pd.DataFrame({ 'id': pd.Categorical(['a', 'b', 'c', 'd']), 'name': pd.Categorical(['Alice', 'Bobby', 'Carl', 'Dan']), 'experience': [1, 5, 3, 8], 'salary': [189.1, 180.2, 190.3, 205.4], }) numeric_columns = df._get_numeric_data().columns # ๐Ÿ‘‡๏ธ Index(['experience', 'salary'], dtype='object') print(numeric_columns) categorical_columns = list(set(df.columns).difference(numeric_columns)) print(categorical_columns) # ๐Ÿ‘‰๏ธ ['name', 'id']

The difference() method returns a new set with elements in the set that are not in the provided iterable.

In other words, set(list2).difference(list1) returns a new set that contains the items in list2 that are not in list1.

# Checking if a specific DataFrame column is Categorical

If you need to check if a specific DataFrame column is categorical:

  1. Get the dtype name of the column by accessing the dtype.name attribute.
  2. Check if the returned string is equal to "category".
main.py
import pandas as pd df = pd.DataFrame({ 'id': pd.Categorical(['a', 'b', 'c', 'd']), 'name': pd.Categorical(['Alice', 'Bobby', 'Carl', 'Dan']), 'experience': [1, 5, 3, 8], 'salary': [189.1, 180.2, 190.3, 205.4], }) if df['name'].dtype.name == 'category': # ๐Ÿ‘‡๏ธ this runs print('The column is categorical') else: print('The column is NOT categorical')
The code for this article is available on GitHub

The dtype attribute returns a dtype object, so we can't directly compare it to the string "category".

Instead, we accessed the name attribute on the object to get the data type name as a string.

The last step is to compare the returned value with the string "category".

I've also written an article on how to change the type of a column to Categorical.

# Additional Resources

You can learn more about the related topics by checking out the following tutorials:

I wrote a book in which I share everything I know about how to become a better, more efficient programmer.
book cover
You can use the search field on my Home Page to filter through all of my articles.

Copyright ยฉ 2024 Borislav Hadzhiev