<p>I have a huge data set and want to predict (not replace) missing values with a machine learning algorithm like svm or random forest in python. </p> <p>My data set looks like this: </p> <pre class="prettyprint"><code>ID i0 i1 i2 i3 i4 i5 j0 j1 j2 j3 j4 j5 0 0.19 -0.02 -0.20 0.07 -0.06 -0.06 -0.06 1.48 0.33 -0.46 -0.37 -0.11 1 -0.61 -0.19 -0.10 -0.1 -0.21 0.63 NA NA NA NA NA NA 2 -0.31 -0.14 -0.64 -0.5 -0.20 -0.30 -0.08 1.56 -0.2 -0.33 0.81 -0.03 . . </code></pre> <p>What I want to do:<br> On the basis of ID 0 and 2 I want to train the values of j0 to j5 with i0 to i5. Subsequent there should be a prediction of the NA's from j0-j5 for ID 1.</p> <p>Question:<br> As the data is not continuous (the time steps end at i5 and start again at j0), is it possible to use some kind of regression?</p> <p>How should the X and the y for the .fit(X, y) and .predict(X) function look like in this example?</p>

<p>In your case, you're looking at at a <strong>multi-output regression</strong> problem:</p> <ul> <li>A <strong>regression</strong> problem - as opposed to classification - since you are trying to predict a value and not a class/state variable/category</li> <li> <strong>Multi-output</strong> since you are trying to predict 6 values for each data point</li> </ul> <p>You can read more in the sklearn documentation about multiclass.</p> <p>Here I'm going to show you how you can use sklearn.multioutput.MultiOutputRegressor with a sklearn.ensemble.RandomForestRegressor to predict your values.</p> <h3>Construct some dummy data</h3> <pre class="prettyprint"><code>from sklearn.datasets import make_regression X,y = make_regression(n_samples=1000, n_features=6, n_informative=3, n_targets=6, tail_strength=0.5, noise=0.02, shuffle=True, coef=False, random_state=0) # Convert to a pandas dataframe like in your example icols = ['i0','i1','i2','i3','i4','i5'] jcols = ['j0', 'j1', 'j2', 'j3', 'j4', 'j5'] df = pd.concat([pd.DataFrame(X, columns=icols), pd.DataFrame(y, columns=jcols)], axis=1) # Introduce a few np.nans in there df.loc[0, jcols] = np.nan df.loc[10, jcols] = np.nan df.loc[100, jcols] = np.nan df.head() Out: i0 i1 i2 i3 i4 i5 j0 j1 j2 j3 j4 \ 0 -0.21 -0.18 -0.06 0.27 -0.32 0.00 NaN NaN NaN NaN NaN 1 0.65 -2.16 0.46 1.82 0.22 -0.13 33.08 39.85 9.63 13.52 16.72 2 -0.75 -0.52 -1.08 0.14 1.12 -1.05 -0.96 -96.02 14.37 25.19 -44.90 3 0.01 0.62 0.20 0.53 0.35 -0.73 6.09 -12.07 -28.88 10.49 0.96 4 0.39 -0.70 -0.55 0.10 1.65 -0.69 83.15 -3.16 93.61 57.44 -17.33 j5 0 NaN 1 17.79 2 -77.48 3 -35.61 4 -2.47 </code></pre> <h3>Exclude the nans initially, and split into 75% train and 25% test</h3> <p>The split is done in order to be able to validate our model.</p> <pre class="prettyprint"><code>notnans = df[jcols].notnull().all(axis=1) df_notnans = df[notnans] # Split into 75% train and 25% test X_train, X_test, y_train, y_test = train_test_split(df_notnans[icols], df_notnans[jcols], train_size=0.75, random_state=4) </code></pre> <h3>Use a multi output regression based on a random forest regressor</h3> <pre class="prettyprint"><code>from sklearn.ensemble import RandomForestRegressor from sklearn.multioutput import MultiOutputRegressor from sklearn.model_selection import train_test_split regr_multirf = MultiOutputRegressor(RandomForestRegressor(max_depth=30, random_state=0)) # Fit on the train data regr_multirf.fit(X_train, y_train) # Check the prediction score score = regr_multirf.score(X_test, y_test) print("The prediction score on the test data is {:.2f}%".format(score*100)) Out: The prediction score on the test data is 96.76% </code></pre> <h3>Predict the nan rows</h3> <pre class="prettyprint"><code>df_nans = df.loc[~notnans].copy() df_nans[jcols] = regr_multirf.predict(df_nans[icols]) df_nans </code></pre> <p>Out: </p> <pre class="prettyprint"><code> i0 i1 i2 i3 i4 i5 j0 \ 0 -0.211620 -0.177927 -0.062205 0.267484 -0.317349 0.000341 -41.254983 10 1.138974 -1.326378 0.123960 0.982841 0.273958 0.414307 46.406351 100 -0.682390 -1.431414 -0.328235 -0.886463 1.212363 -0.577676 94.971966 j1 j2 j3 j4 j5 0 -18.197513 -31.029952 -14.749244 -5.990595 -9.296744 10 67.915628 59.750032 15.612843 10.177314 38.226387 100 -3.724223 65.630692 44.636895 -14.372414 11.947185 </code></pre>

Predict NA (missing values) with machine learning

Tags:

python

pandas

machine-learning

na

scikit-learn

I have a huge data set and want to predict (not replace) missing values with a machine learning algorithm like svm or random forest in python.

My data set looks like this:

ID i0   i1    i2    i3    i4   i5     j0    j1   j2   j3    j4    j5    

0  0.19 -0.02 -0.20 0.07 -0.06 -0.06  -0.06 1.48 0.33 -0.46 -0.37 -0.11
1 -0.61 -0.19 -0.10 -0.1 -0.21  0.63   NA    NA   NA   NA    NA    NA
2 -0.31 -0.14 -0.64 -0.5 -0.20 -0.30  -0.08 1.56 -0.2 -0.33  0.81 -0.03
.
.

What I want to do:
On the basis of ID 0 and 2 I want to train the values of j0 to j5 with i0 to i5. Subsequent there should be a prediction of the NA's from j0-j5 for ID 1.

Question:
As the data is not continuous (the time steps end at i5 and start again at j0), is it possible to use some kind of regression?

How should the X and the y for the .fit(X, y) and .predict(X) function look like in this example?

650

asked Dec 06 '16 13:12

mayyer

1 Answers

In your case, you're looking at at a multi-output regression problem:

A regression problem - as opposed to classification - since you are trying to predict a value and not a class/state variable/category
Multi-output since you are trying to predict 6 values for each data point

You can read more in the sklearn documentation about multiclass.

Here I'm going to show you how you can use sklearn.multioutput.MultiOutputRegressor with a sklearn.ensemble.RandomForestRegressor to predict your values.

Construct some dummy data

from sklearn.datasets import make_regression

X,y = make_regression(n_samples=1000, n_features=6,
                                 n_informative=3, n_targets=6,  
                                 tail_strength=0.5, noise=0.02, 
                                 shuffle=True, coef=False, random_state=0)

# Convert to a pandas dataframe like in your example
icols = ['i0','i1','i2','i3','i4','i5']
jcols = ['j0', 'j1', 'j2', 'j3', 'j4', 'j5']
df = pd.concat([pd.DataFrame(X, columns=icols),
                pd.DataFrame(y, columns=jcols)], axis=1)

# Introduce a few np.nans in there
df.loc[0, jcols] = np.nan
df.loc[10, jcols] = np.nan
df.loc[100, jcols] = np.nan

df.head()

Out:
     i0    i1    i2    i3    i4    i5     j0     j1     j2     j3     j4  \
0 -0.21 -0.18 -0.06  0.27 -0.32  0.00    NaN    NaN    NaN    NaN    NaN   
1  0.65 -2.16  0.46  1.82  0.22 -0.13  33.08  39.85   9.63  13.52  16.72   
2 -0.75 -0.52 -1.08  0.14  1.12 -1.05  -0.96 -96.02  14.37  25.19 -44.90   
3  0.01  0.62  0.20  0.53  0.35 -0.73   6.09 -12.07 -28.88  10.49   0.96   
4  0.39 -0.70 -0.55  0.10  1.65 -0.69  83.15  -3.16  93.61  57.44 -17.33   

      j5  
0    NaN  
1  17.79  
2 -77.48  
3 -35.61  
4  -2.47

Exclude the nans initially, and split into 75% train and 25% test

The split is done in order to be able to validate our model.

notnans = df[jcols].notnull().all(axis=1)
df_notnans = df[notnans]

# Split into 75% train and 25% test
X_train, X_test, y_train, y_test = train_test_split(df_notnans[icols], df_notnans[jcols],
                                                    train_size=0.75,
                                                    random_state=4)

Use a multi output regression based on a random forest regressor

from sklearn.ensemble import RandomForestRegressor
from sklearn.multioutput import MultiOutputRegressor
from sklearn.model_selection import train_test_split

regr_multirf = MultiOutputRegressor(RandomForestRegressor(max_depth=30,
                                                          random_state=0))

# Fit on the train data
regr_multirf.fit(X_train, y_train)

# Check the prediction score
score = regr_multirf.score(X_test, y_test)
print("The prediction score on the test data is {:.2f}%".format(score*100))

Out: The prediction score on the test data is 96.76%

Predict the nan rows

df_nans = df.loc[~notnans].copy()
df_nans[jcols] = regr_multirf.predict(df_nans[icols])
df_nans

Out:

           i0        i1        i2        i3        i4        i5         j0  \
0   -0.211620 -0.177927 -0.062205  0.267484 -0.317349  0.000341 -41.254983   
10   1.138974 -1.326378  0.123960  0.982841  0.273958  0.414307  46.406351   
100 -0.682390 -1.431414 -0.328235 -0.886463  1.212363 -0.577676  94.971966   

            j1         j2         j3         j4         j5  
0   -18.197513 -31.029952 -14.749244  -5.990595  -9.296744  
10   67.915628  59.750032  15.612843  10.177314  38.226387  
100  -3.724223  65.630692  44.636895 -14.372414  11.947185

137

answered Sep 24 '22 23:09

Julien Marrec

Related questions
                            
                                Spacy Pipeline?
                            
                                How to define and use percentage in Pint
                            
                                How do i save many to many fields objects using django rest framework
                            
                                Tkinter - How to stop a loop with a stop button?
                            
                                How to capture website screenshot in high resolution?
                            
                                Pandas dataframe pivot not fitting in memory
                            
                                peewee and peewee-async: why is async slower
                            
                                Python: Open Excel Workbook using Win32 COM Api
                            
                                PyInstaller cannot add .txt files
                            
                                POST request works in Postman but not in Python
                            
                                Pandas find first nan value by rows and return column name
                            
                                Count number of tabs open in Selenium Python
                            
                                Set pandas.tseries.index.DatetimeIndex.freq with inferred_freq
                            
                                How do I read a 2 column csv file and create a dictionary?
                            
                                Check state of button in tkinter
                            
                                How to extract the cell state and hidden state from an RNN model in tensorflow?
                            
                                Packaging local module with pex
                            
                                GridSearchCV: How to specify test set?
                            
                                Securing Django OAuth Toolkit Views
                            
                                How to stop Django from escaping the # symbol

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With