Pandas: Find the percentage of Missing values in each Column

avatar
Borislav Hadzhiev

Last updated: Apr 12, 2024
5 min

banner

# Table of Contents

  1. Pandas: Find the percentage of Missing values in each Column
  2. Getting the results in a new DataFrame
  3. Sorting the new DataFrame by the percent_missing column
  4. Find the percentage of Missing values in each Column using mean()

# Pandas: Find the percentage of Missing values in each Column

To find the percentage of missing values in each column in a Pandas DataFrame:

  1. Use the DataFrame.isnull() method to detect the missing values in the DataFrame.
  2. Sum the missing values, multiply the sum by 100 and divide the result by the length of the DataFrame.
main.py
import pandas as pd df = pd.DataFrame({ 'name': ['Alice', 'Bobby', None, None], 'experience': [None, 5, None, None], 'salary': [None, 180.2, 190.3, 205.4], }) percent_missing = df.isnull().sum() * 100 / len(df) print(percent_missing) print('-' * 50) df2 = pd.DataFrame({ 'col_name': df.columns, 'percent_missing': percent_missing }) print(df2)
The code for this article is available on GitHub

Running the code sample produces the following output.

shell
name 50.0 experience 75.0 salary 25.0 dtype: float64 -------------------------------------------------- col_name percent_missing name name 50.0 experience experience 75.0 salary salary 25.0

find percentage of missing values in each column

The DataFrame.isnull() method detects the missing values in the DataFrame.

The method returns True for each missing value and False otherwise.

main.py
import pandas as pd df = pd.DataFrame({ 'name': ['Alice', 'Bobby', None, None], 'experience': [None, 5, None, None], 'salary': [None, 180.2, 190.3, 205.4], }) # name experience salary # 0 False True True # 1 False False False # 2 True True False # 3 True True False print(df.isnull())

detect missing values in dataframe

The code for this article is available on GitHub

To get the number of missing values in each column as a percentage, multiply the sum by 100 and divide the result by the DataFrame's length.

main.py
import pandas as pd df = pd.DataFrame({ 'name': ['Alice', 'Bobby', None, None], 'experience': [None, 5, None, None], 'salary': [None, 180.2, 190.3, 205.4], }) percent_missing = df.isnull().sum() * 100 / len(df) # name 50.0 # experience 75.0 # salary 25.0 # dtype: float64 print(percent_missing)

get percentage of missing values

# Getting the results in a new DataFrame

If you need to get the percentage of missing values in each column in a DataFrame, use the pd.DataFrame() class.

main.py
import pandas as pd df = pd.DataFrame({ 'name': ['Alice', 'Bobby', None, None], 'experience': [None, 5, None, None], 'salary': [None, 180.2, 190.3, 205.4], }) percent_missing = df.isnull().sum() * 100 / len(df) df2 = pd.DataFrame({ 'col_name': df.columns, 'percent_missing': percent_missing }) # col_name percent_missing # name name 50.0 # experience experience 75.0 # salary salary 25.0 print(df2)

get percentage of missing values in new dataframe

The code for this article is available on GitHub

# Sorting the new DataFrame by the percent_missing column

If you need to sort the new DataFrame by the percent_missing column, use the sort_values() method.

main.py
import pandas as pd df = pd.DataFrame({ 'name': ['Alice', 'Bobby', None, None], 'experience': [None, 5, None, None], 'salary': [None, 180.2, 190.3, 205.4], }) percent_missing = df.isnull().sum() * 100 / len(df) df2 = pd.DataFrame({ 'col_name': df.columns, 'percent_missing': percent_missing }) print(df2) print('-' * 50) df2.sort_values('percent_missing', inplace=True) print(df2)
The code for this article is available on GitHub

Running the code sample produces the following output.

shell
col_name percent_missing name name 50.0 experience experience 75.0 salary salary 25.0 -------------------------------------------------- col_name percent_missing salary salary 25.0 name name 50.0 experience experience 75.0

sort new dataframe by percent missing column

The DataFrame.sort_values method sorts the values in the DataFrame along a given axis.

# Find the percentage of Missing values in each Column using mean()

You can also use the isnull() and mean() methods to find the percentage of missing values in each column in a Pandas DataFrame.

main.py
import pandas as pd df = pd.DataFrame({ 'name': ['Alice', 'Bobby', None, None], 'experience': [None, 5, None, None], 'salary': [None, 180.2, 190.3, 205.4], }) percent_missing = df.isnull().mean() * 100 print(percent_missing) print('-' * 50) df2 = pd.DataFrame({ 'col_name': df.columns, 'percent_missing': percent_missing }) print(df2)
The code for this article is available on GitHub

Running the code sample produces the following output.

shell
name 50.0 experience 75.0 salary 25.0 dtype: float64 -------------------------------------------------- col_name percent_missing name name 50.0 experience experience 75.0 salary salary 25.0

find percentage of missing values using mean

We used the isnull() method to detect the missing values in each column and then called the DataFrame.mean method on the result.

main.py
percent_missing = df.isnull().mean() * 100 # name 50.0 # experience 75.0 # salary 25.0 # dtype: float64 print(percent_missing)

The DataFrame.mean method returns the mean of the values over the requested axis.

The last step is to multiply the result by 100 to get the percentage of missing values.

If you need to convert the result to a DataFrame, use the pd.DataFrame class.

main.py
df2 = pd.DataFrame({ 'col_name': df.columns, 'percent_missing': percent_missing }) # col_name percent_missing # name name 50.0 # experience experience 75.0 # salary salary 25.0 print(df2)

You can also use the mul() method for the multiplication.

main.py
import pandas as pd df = pd.DataFrame({ 'name': ['Alice', 'Bobby', None, None], 'experience': [None, 5, None, None], 'salary': [None, 180.2, 190.3, 205.4], }) percent_missing = df.isnull().mean().round(4).mul(100) # name 50.0 # experience 75.0 # salary 25.0 # dtype: float64 print(percent_missing)
The code for this article is available on GitHub

If you need to sort the values, use the sort_values() method and set the ascending argument to True or False.

main.py
import pandas as pd df = pd.DataFrame({ 'name': ['Alice', 'Bobby', None, None], 'experience': [None, 5, None, None], 'salary': [None, 180.2, 190.3, 205.4], }) percent_missing = df.isnull().mean().round( 4).mul(100).sort_values(ascending=True) # salary 25.0 # name 50.0 # experience 75.0 # dtype: float64 print(percent_missing)

The code sample sorts the values in ascending order, but you can also sort them in descending order by setting ascending to False.

You can also pass the result to the pd.DataFrame() class to convert it to a DataFrame.

main.py
import pandas as pd df = pd.DataFrame({ 'name': ['Alice', 'Bobby', None, None], 'experience': [None, 5, None, None], 'salary': [None, 180.2, 190.3, 205.4], }) percent_missing = pd.DataFrame( df.isnull().mean().round(4).mul(100).sort_values(ascending=True), columns=['percent_missing'] ) # percent_missing # salary 25.0 # name 50.0 # experience 75.0 print(percent_missing)
The code for this article is available on GitHub

# Additional Resources

You can learn more about the related topics by checking out the following tutorials:

I wrote a book in which I share everything I know about how to become a better, more efficient programmer.
book cover
You can use the search field on my Home Page to filter through all of my articles.