ValueError: Some of types cannot be determined after inferring

avatar
Borislav Hadzhiev

Last updated: Apr 12, 2024
3 min

banner

# ValueError: Some of types cannot be determined after inferring

The PySpark "ValueError: Some of types cannot be determined after inferring" occurs when you try to convert a Pandas DataFrame to a PySpark DataFrame but one or more of the columns only contain None (missing) values.

To solve the error, manually define a schema and supply it when calling spark.createDataFrame().

Here is an example of how the error occurs.

main.py
import pandas as pd from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() df = pd.DataFrame({ 'name': ['Alice', 'Bobby', 'Carl', 'Dan'], 'experience': [11, 14, 16, 18], 'salary': [None, None, None, None], }) # ⛔️ ValueError: Some of types cannot be determined after inferring new_df = spark.createDataFrame(df) print(new_df)

value error some of types cannot be determined after inferring

By default, PySpark tries to infer the type of the columns based on the type of the non-none values in each column.

However, if a column only contains None (or missing) values, PySpark is not able to infer its type and raises the error.

# Manually define the schema to solve the error

To solve the error, manually define the schema and specify the type of the values in each column.

main.py
import pandas as pd from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, StringType, DoubleType, IntegerType spark = SparkSession.builder.getOrCreate() df = pd.DataFrame({ 'name': ['Alice', 'Bobby', 'Carl', 'Dan'], 'experience': [11, 14, 16, 18], 'salary': [None, None, None, None], }) schema = StructType( [ StructField("name", StringType(), nullable=True), StructField("experience", IntegerType(), nullable=True), StructField("salary", DoubleType(), nullable=True), ] ) new_df = spark.createDataFrame(df, schema=schema) # DataFrame[name: string, experience: int, salary: double] print(new_df)

manually define schema to solve the error

The code for this article is available on GitHub

We used the StructType class to define a list of StructFields.

main.py
import pandas as pd from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, StringType, DoubleType, IntegerType schema = StructType( [ StructField("name", StringType(), nullable=True), StructField("experience", IntegerType(), nullable=True), StructField("salary", DoubleType(), nullable=True), ] )

We passed 3 arguments to each StructField:

  1. The name of the field.
  2. The dataType of the field.
  3. Whether the field can be null (None) or not.

Make sure to import each type as shown in the code sample.

Depending on what types your columns contain, you will likely have to adjust the field names and types.

The last step is to ensure that you pass the schema when calling spark.createDataFrame().

main.py
new_df = spark.createDataFrame(df, schema=schema)

The SparkSession.createDataFrame method creates a PySpark DataFrame from a Pandas DataFrame.

When the schema argument is None, the method tries to infer the schema (column names and types) from the supplied data.

When the schema argument is a list of column names, the type of each column is inferred from the supplied data.

# Solving the error by dropping the columns with all None values

You can also solve the error by dropping the columns that contain all None values by using the DataFrame.dropna() method.

main.py
import pandas as pd from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() df = pd.DataFrame({ 'name': ['Alice', 'Bobby', 'Carl', 'Dan'], 'experience': [11, 14, 16, 18], 'salary': [None, None, None, None], }) # ✅ Drop columns with all None values df = df.dropna(axis='columns', how='all') new_df = spark.createDataFrame(df) print(new_df)

drop columns containing all missing values

The code for this article is available on GitHub

The DataFrame.dropna() method removes missing values from the DataFrame.

When the axis argument is set to 0 (columns), the method drops columns that contain missing values.

We also set the how argument to "all" which means "if all values are NA (missing), drop that column".

Therefore, the dropna() method call will drop all columns that only contain missing values, so the spark.createDataFrame() method will be able to infer the type of each field.

We could've also dropped all None-only columns in place.

main.py
import pandas as pd from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() df = pd.DataFrame({ 'name': ['Alice', 'Bobby', 'Carl', 'Dan'], 'experience': [11, 14, 16, 18], 'salary': [None, None, None, None], }) # ✅ Drop columns with all None values df.dropna(axis='columns', how='all', inplace=True) new_df = spark.createDataFrame(df) print(new_df)

drop columns with all none in place

The code for this article is available on GitHub

When the inplace argument is set to True, the columns with all None values get dropped in place (the original DataFrame is mutated).

# Additional Resources

You can learn more about the related topics by checking out the following tutorials:

I wrote a book in which I share everything I know about how to become a better, more efficient programmer.
book cover
You can use the search field on my Home Page to filter through all of my articles.