Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using scikit-learn (sklearn), how to handle missing data for linear regression?

I tried this but couldn't get it to work for my data: Use Scikit Learn to do linear regression on a time series pandas data frame

My data consists of 2 DataFrames. DataFrame_1.shape = (40,5000) and DataFrame_2.shape = (40,74). I'm trying to do some type of linear regression, but DataFrame_2 contains NaN missing data values. When I DataFrame_2.dropna(how="any") the shape drops to (2,74).

Is there any linear regression algorithm in sklearn that can handle NaN values?

I'm modeling it after the load_boston from sklearn.datasets where X,y = boston.data, boston.target = (506,13),(506,)

Here's my simplified code:

X = DataFrame_1
for col in DataFrame_2.columns:
    y = DataFrame_2[col]
    model = LinearRegression()
    model.fit(X,y)

#ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

I did the above format to get the shapes to match up of the matrices

If posting the DataFrame_2 would help, please comment below and I'll add it.

like image 293
O.rka Avatar asked Oct 13 '15 22:10

O.rka


People also ask

Can you do linear regression with missing data?

Linear RegressionThe variable with missing data is used as the dependent variable. Cases with complete data for the predictor variables are used to generate the regression equation; the equation is then used to predict missing values for incomplete cases.

How do you handle missing data in regression analysis?

By far the most common approach to the missing data is to simply omit those cases with the missing data and analyze the remaining data. This approach is known as the complete case (or available case) analysis or listwise deletion.

Can Sklearn handle missing values?

Missing values can be imputed with a provided constant value, or using the statistics (mean, median or most frequent) of each column in which the missing values are located. This class also allows for different missing values encodings. >>> import numpy as np >>> from sklearn.

How do you impute missing values in linear regression?

With regression imputation the information of other variables is used to predict the missing values in a variable by using a regression model. Commonly, first the regression model is estimated in the observed data and subsequently using the regression weights the missing values are predicted and replaced.


2 Answers

If your variable is a DataFrame, you could use fillna. Here I replaced the missing data with the mean of that column.

df.fillna(df.mean(), inplace=True)
like image 192
Foreever Avatar answered Oct 05 '22 20:10

Foreever


You can fill in the null values in y with imputation. In scikit-learn this is done with the following code snippet:

from sklearn.preprocessing import Imputer
imputer = Imputer()
y_imputed = imputer.fit_transform(y)

Otherwise, you might want to build your model using a subset of the 74 columns as predictors, perhaps some of your columns contain less null values?

like image 25
maxymoo Avatar answered Oct 05 '22 18:10

maxymoo