Last updated: Apr 12, 2024
Reading time·3 min

The PySpark "ValueError: Some of types cannot be determined after inferring"
occurs when you try to convert a Pandas DataFrame to a PySpark DataFrame but
one or more of the columns only contain None (missing) values.
To solve the error, manually define a schema and supply it when calling
spark.createDataFrame().
Here is an example of how the error occurs.
import pandas as pd from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() df = pd.DataFrame({ 'name': ['Alice', 'Bobby', 'Carl', 'Dan'], 'experience': [11, 14, 16, 18], 'salary': [None, None, None, None], }) # ⛔️ ValueError: Some of types cannot be determined after inferring new_df = spark.createDataFrame(df) print(new_df)

By default, PySpark tries to infer the type of the columns based on the type of the non-none values in each column.
However, if a column only contains None (or missing) values, PySpark is not able to infer its type and raises the error.
To solve the error, manually define the schema and specify the type of the values in each column.
import pandas as pd from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, StringType, DoubleType, IntegerType spark = SparkSession.builder.getOrCreate() df = pd.DataFrame({ 'name': ['Alice', 'Bobby', 'Carl', 'Dan'], 'experience': [11, 14, 16, 18], 'salary': [None, None, None, None], }) schema = StructType( [ StructField("name", StringType(), nullable=True), StructField("experience", IntegerType(), nullable=True), StructField("salary", DoubleType(), nullable=True), ] ) new_df = spark.createDataFrame(df, schema=schema) # DataFrame[name: string, experience: int, salary: double] print(new_df)

We used the StructType class to define a list of StructFields.
import pandas as pd from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, StringType, DoubleType, IntegerType schema = StructType( [ StructField("name", StringType(), nullable=True), StructField("experience", IntegerType(), nullable=True), StructField("salary", DoubleType(), nullable=True), ] )
We passed 3 arguments to each StructField:
name of the field.dataType of the field.null (None) or not.Make sure to import each type as shown in the code sample.
Depending on what types your columns contain, you will likely have to adjust the field names and types.
The last step is to ensure that you pass the schema when calling
spark.createDataFrame().
new_df = spark.createDataFrame(df, schema=schema)
The
SparkSession.createDataFrame
method creates a PySpark DataFrame from a Pandas DataFrame.
When the schema argument is None, the method tries to infer the schema
(column names and types) from the supplied data.
When the schema argument is a list of column names, the type of each column is
inferred from the supplied data.
None valuesYou can also solve the error by dropping the columns that contain all None
values by using the
DataFrame.dropna()
method.
import pandas as pd from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() df = pd.DataFrame({ 'name': ['Alice', 'Bobby', 'Carl', 'Dan'], 'experience': [11, 14, 16, 18], 'salary': [None, None, None, None], }) # ✅ Drop columns with all None values df = df.dropna(axis='columns', how='all') new_df = spark.createDataFrame(df) print(new_df)

The DataFrame.dropna() method removes missing values from the DataFrame.
When the axis argument is set to 0 (columns), the method drops columns that
contain missing values.
We also set the how argument to "all" which means "if all values are NA
(missing), drop that column".
Therefore, the dropna() method call will drop all columns that only contain
missing values, so the spark.createDataFrame() method will be able to infer
the type of each field.
We could've also dropped all None-only columns in place.
import pandas as pd from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() df = pd.DataFrame({ 'name': ['Alice', 'Bobby', 'Carl', 'Dan'], 'experience': [11, 14, 16, 18], 'salary': [None, None, None, None], }) # ✅ Drop columns with all None values df.dropna(axis='columns', how='all', inplace=True) new_df = spark.createDataFrame(df) print(new_df)

When the inplace argument is set to True, the columns with all None values
get dropped in place (the original DataFrame is mutated).
You can learn more about the related topics by checking out the following tutorials: