Last updated: Apr 12, 2024
Reading time·3 min
The PySpark "ValueError: Some of types cannot be determined after inferring"
occurs when you try to convert a Pandas DataFrame
to a PySpark DataFrame
but
one or more of the columns only contain None
(missing) values.
To solve the error, manually define a schema and supply it when calling
spark.createDataFrame()
.
Here is an example of how the error occurs.
import pandas as pd from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() df = pd.DataFrame({ 'name': ['Alice', 'Bobby', 'Carl', 'Dan'], 'experience': [11, 14, 16, 18], 'salary': [None, None, None, None], }) # ⛔️ ValueError: Some of types cannot be determined after inferring new_df = spark.createDataFrame(df) print(new_df)
By default, PySpark tries to infer the type of the columns based on the type of the non-none values in each column.
However, if a column only contains None (or missing) values, PySpark is not able to infer its type and raises the error.
To solve the error, manually define the schema and specify the type of the values in each column.
import pandas as pd from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, StringType, DoubleType, IntegerType spark = SparkSession.builder.getOrCreate() df = pd.DataFrame({ 'name': ['Alice', 'Bobby', 'Carl', 'Dan'], 'experience': [11, 14, 16, 18], 'salary': [None, None, None, None], }) schema = StructType( [ StructField("name", StringType(), nullable=True), StructField("experience", IntegerType(), nullable=True), StructField("salary", DoubleType(), nullable=True), ] ) new_df = spark.createDataFrame(df, schema=schema) # DataFrame[name: string, experience: int, salary: double] print(new_df)
We used the StructType class to define a list of StructFields.
import pandas as pd from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, StringType, DoubleType, IntegerType schema = StructType( [ StructField("name", StringType(), nullable=True), StructField("experience", IntegerType(), nullable=True), StructField("salary", DoubleType(), nullable=True), ] )
We passed 3 arguments to each StructField:
name
of the field.dataType
of the field.null
(None) or not.Make sure to import each type as shown in the code sample.
Depending on what types your columns contain, you will likely have to adjust the field names and types.
The last step is to ensure that you pass the schema
when calling
spark.createDataFrame()
.
new_df = spark.createDataFrame(df, schema=schema)
The
SparkSession.createDataFrame
method creates a PySpark DataFrame
from a Pandas DataFrame
.
When the schema
argument is None
, the method tries to infer the schema
(column names and types) from the supplied data.
When the schema
argument is a list of column names, the type of each column is
inferred from the supplied data.
None
valuesYou can also solve the error by dropping the columns that contain all None
values by using the
DataFrame.dropna()
method.
import pandas as pd from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() df = pd.DataFrame({ 'name': ['Alice', 'Bobby', 'Carl', 'Dan'], 'experience': [11, 14, 16, 18], 'salary': [None, None, None, None], }) # ✅ Drop columns with all None values df = df.dropna(axis='columns', how='all') new_df = spark.createDataFrame(df) print(new_df)
The DataFrame.dropna()
method removes missing values from the DataFrame
.
When the axis
argument is set to 0
(columns), the method drops columns that
contain missing values.
We also set the how
argument to "all"
which means "if all values are NA
(missing), drop that column".
Therefore, the dropna()
method call will drop all columns that only contain
missing values, so the spark.createDataFrame()
method will be able to infer
the type of each field.
We could've also dropped all None-only columns in place.
import pandas as pd from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() df = pd.DataFrame({ 'name': ['Alice', 'Bobby', 'Carl', 'Dan'], 'experience': [11, 14, 16, 18], 'salary': [None, None, None, None], }) # ✅ Drop columns with all None values df.dropna(axis='columns', how='all', inplace=True) new_df = spark.createDataFrame(df) print(new_df)
When the inplace
argument is set to True
, the columns with all None
values
get dropped in place (the original DataFrame
is mutated).
You can learn more about the related topics by checking out the following tutorials: