I tried this but couldn't get it to work for my data: Use Scikit Learn to do linear regression on a time series pandas data frame
My data consists of 2 DataFrames. DataFrame_1.shape = (40,5000)
and DataFrame_2.shape = (40,74)
. I'm trying to do some type of linear regression, but DataFrame_2
contains NaN
missing data values. When I DataFrame_2.dropna(how="any")
the shape drops to (2,74)
.
Is there any linear regression algorithm in sklearn that can handle NaN
values?
I'm modeling it after the load_boston
from sklearn.datasets
where X,y = boston.data, boston.target = (506,13),(506,)
Here's my simplified code:
X = DataFrame_1
for col in DataFrame_2.columns:
y = DataFrame_2[col]
model = LinearRegression()
model.fit(X,y)
#ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
I did the above format to get the shapes to match up of the matrices
If posting the DataFrame_2
would help, please comment below and I'll add it.
Linear RegressionThe variable with missing data is used as the dependent variable. Cases with complete data for the predictor variables are used to generate the regression equation; the equation is then used to predict missing values for incomplete cases.
By far the most common approach to the missing data is to simply omit those cases with the missing data and analyze the remaining data. This approach is known as the complete case (or available case) analysis or listwise deletion.
Missing values can be imputed with a provided constant value, or using the statistics (mean, median or most frequent) of each column in which the missing values are located. This class also allows for different missing values encodings. >>> import numpy as np >>> from sklearn.
With regression imputation the information of other variables is used to predict the missing values in a variable by using a regression model. Commonly, first the regression model is estimated in the observed data and subsequently using the regression weights the missing values are predicted and replaced.
If your variable is a DataFrame, you could use fillna
. Here I replaced the missing data with the mean of that column.
df.fillna(df.mean(), inplace=True)
You can fill in the null values in y
with imputation. In scikit-learn
this is done with the following code snippet:
from sklearn.preprocessing import Imputer
imputer = Imputer()
y_imputed = imputer.fit_transform(y)
Otherwise, you might want to build your model using a subset of the 74 columns as predictors, perhaps some of your columns contain less null values?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With