Using scikit-learn (sklearn), how to handle missing data for linear regression?

Tags:

I tried this but couldn't get it to work for my data: Use Scikit Learn to do linear regression on a time series pandas data frame

My data consists of 2 DataFrames. DataFrame_1.shape = (40,5000) and DataFrame_2.shape = (40,74). I'm trying to do some type of linear regression, but DataFrame_2 contains NaN missing data values. When I DataFrame_2.dropna(how="any") the shape drops to (2,74).

Is there any linear regression algorithm in sklearn that can handle NaN values?

I'm modeling it after the load_boston from sklearn.datasets where X,y = boston.data, boston.target = (506,13),(506,)

Here's my simplified code:

X = DataFrame_1
for col in DataFrame_2.columns:
    y = DataFrame_2[col]
    model = LinearRegression()
    model.fit(X,y)

#ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

I did the above format to get the shapes to match up of the matrices

If posting the DataFrame_2 would help, please comment below and I'll add it.

293

asked Oct 13 '15 22:10

O.rka

2 Answers

If your variable is a DataFrame, you could use fillna. Here I replaced the missing data with the mean of that column.

df.fillna(df.mean(), inplace=True)

192

answered Oct 05 '22 20:10

Foreever

You can fill in the null values in y with imputation. In scikit-learn this is done with the following code snippet:

from sklearn.preprocessing import Imputer
imputer = Imputer()
y_imputed = imputer.fit_transform(y)

Otherwise, you might want to build your model using a subset of the 74 columns as predictors, perhaps some of your columns contain less null values?

answered Oct 05 '22 18:10

maxymoo

Related questions
                            
                                How does vlc.py play video stream?
                            
                                Python: Update XML-file using ElementTree while conserving layout as much as possible
                            
                                How can I skip a test if another test fails with py.test?
                            
                                Billing aliens via POS printer and image print
                            
                                JSON-encoding very long iterators
                            
                                How to record a video in Selenium webdriver [closed]
                            
                                Is there a builtin function version of `and` and/or `or` in Python?
                            
                                Share a numpy array in gunicorn processes
                            
                                Tidy for Jinja2 templates
                            
                                How do python Set Comprehensions work?
                            
                                String formatting without index in python2.6
                            
                                What method does Python 2 use to print tuples?
                            
                                How can I construct a list of faces from a list of edges, with consistent vertex ordering?
                            
                                Cython Memoryview as return value
                            
                                Matplotlib set_clip_path requires patch to be plotted
                            
                                alembic - Example of using package resource as script_location
                            
                                What does the order parameter in numpy.array() do AKA what is contiguous order?
                            
                                How to tell which specific compiler will be invoked for a Python C extension: GCC or Clang?
                            
                                Formatting Flask app logs in json
                            
                                Python Requests/urllib — monitoring bandwidth usage

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using scikit-learn (sklearn), how to handle missing data for linear regression?

Tags:

python

pandas

machine-learning

scikit-learn

linear-regression

O.rka

People also ask

2 Answers

Foreever

maxymoo

Recent Activity

Donate For Us