Last updated: Apr 11, 2024
Reading timeยท4 min
The article addresses the following 2 related errors:
The Python "ValueError: Input X contains infinity or a value too large for dtype('float64')" occurs when your matrix contains infinite or NaN values.
To solve the error, remove the infinite and NaN values from your matrix before the computation.
Here is an example of how the error occurs.
import numpy as np import pandas as pd from sklearn.linear_model import LinearRegression # Contains inf and nan data = {'A': [1, np.inf, 3], 'B': [4, np.nan, 6], 'C': [7, 8, 9]} df = pd.DataFrame(data) X = df[['A', 'B']] y = df['C'] model = LinearRegression() # โ๏ธ ValueError: Input X contains infinity or a value too large for dtype('float64'). model.fit(X, y)
Notice that the DataFrame contains inf
(infinite) and NaN
(not a number)
values.
import numpy as np import pandas as pd from sklearn.linear_model import LinearRegression # Contains inf and nan data = {'A': [1, np.inf, 3], 'B': [4, np.nan, 6], 'C': [7, 8, 9]} df = pd.DataFrame(data) # A B C # 0 1.0 4.0 7 # 1 inf NaN 8 # 2 3.0 6.0 9 print(df)
We have to remove the infinite and NaN
values from the DataFrame
before
calling the fit()
method.
You can check if your DataFrame
contains NaN
or inf
values by using 2
methods:
NaN
element-wise.import numpy as np import pandas as pd from sklearn.linear_model import LinearRegression # Contains inf and nan data = {'A': [1, np.inf, 3], 'B': [4, np.nan, 6], 'C': [7, 8, 9]} df = pd.DataFrame(data) print(np.any(np.isnan(df))) # ๐๏ธ True print(np.all(np.isfinite(df))) # ๐๏ธ False
The np.any(np.isnan(df))
method call will return True
if the DataFrame
contains at least one NaN
value.
The np.all(np.isfinite(df))
method call will return False
if the DataFrame
contains at least one inf
value.
NaN
and inf
values from the DataFrame
to solve the errorYou can solve the error by removing the NaN
and inf
values from the
DataFrame
.
import numpy as np import pandas as pd from sklearn.linear_model import LinearRegression data = {'A': [1, np.inf, 3], 'B': [4, np.nan, 6], 'C': [7, 8, 9]} df = pd.DataFrame(data) # Replace the `inf` values with `NaN` df.replace([np.inf, -np.inf], np.nan) # Drop the rows that have `NaN` values df.dropna(inplace=True) print(np.any(np.isnan(df))) # ๐๏ธ False print(np.all(np.isfinite(df))) # ๐๏ธ True # A B C # 0 1.0 4.0 7 # 2 3.0 6.0 9 print(df)
I used the following 2 lines to remove the inf
and NaN
values from the
DataFrame
.
# Replace the `inf` values with `NaN` df.replace([np.inf, -np.inf], np.nan) # Drop the rows that have `NaN` values df.dropna(inplace=True)
The call to the replace()
method replaces inf
values with NaN
.
The call to the dropna()
method drops the rows that have NaN
values.
As the following code sample shows, the DataFrame
no longer contains any inf
or NaN
values.
print(np.any(np.isnan(df))) # ๐๏ธ False print(np.all(np.isfinite(df))) # ๐๏ธ True # A B C # 0 1.0 4.0 7 # 2 3.0 6.0 9 print(df)
Here is a complete example that demonstrates that the error has been resolved.
import numpy as np import pandas as pd from sklearn.linear_model import LinearRegression data = {'A': [1, np.inf, 3], 'B': [4, np.nan, 6], 'C': [7, 8, 9]} df = pd.DataFrame(data) # Replace the `inf` values with `NaN` df.replace([np.inf, -np.inf], np.nan) # Drop the rows that have `NaN` values df.dropna(inplace=True) X = df[['A', 'B']] y = df['C'] model = LinearRegression() reg = model.fit(X, y) print(reg.score(X, y)) # ๐๏ธ 1.0
You can also define a reusable function that removes the inf
and NaN
values
from a DataFrame
.
import numpy as np import pandas as pd from sklearn.linear_model import LinearRegression data = {'A': [1, np.inf, 3], 'B': [4, np.nan, 6], 'C': [7, 8, 9]} df = pd.DataFrame(data) def remove_inf_nan(df): # Replace the `inf` values with `NaN` df.replace([np.inf, -np.inf], np.nan) # Drop the rows that have `NaN` values df.dropna(inplace=True) return df df = remove_inf_nan(df) print(np.any(np.isnan(df))) # ๐๏ธ False print(np.all(np.isfinite(df))) # ๐๏ธ True # A B C # 0 1.0 4.0 7 # 2 3.0 6.0 9 print(df)
The remove_inf_nan
function takes a DataFrame
as a parameter, removes the
inf
and NaN
values and returns the result.
inf
and NaN
valuesIf you instead want to replace the inf
and NaN
values, use the
DataFrame.fillna()
method.
import numpy as np import pandas as pd from sklearn.linear_model import LinearRegression data = {'A': [1, np.inf, 3], 'B': [4, np.nan, 6], 'C': [7, 8, 9]} df = pd.DataFrame(data) # Replace the `inf` values with `NaN` df.replace([np.inf, -np.inf], np.nan, inplace=True) # ๐๏ธ Replace `NaN` values with 555 df.fillna(555, inplace=True) print(np.any(np.isnan(df))) # ๐๏ธ False print(np.all(np.isfinite(df))) # ๐๏ธ True # A B C # 0 1.0 4.0 7 # 1 555.0 555.0 8 # 2 3.0 6.0 9 print(df)
We used the DataFrame.replace()
method to replace the inf
values with NaN
.
We then used the DataFrame.fillna()
method to replace the NaN
values with
the number 555
.
The fillna()
method fills NA/NaN
values.
If the error persists, try to reset the index of your DataFrame
before the
computation.
df = df.reset_index()
The DataFrame.reset_index
method resets the index of the given DataFrame
or a level of it.
This often resolves the error if it was caused by removing entries from your
DataFrame
.
You can learn more about the related topics by checking out the following tutorials: