I have a huge data set and want to predict (not replace) missing values with a machine learning algorithm like svm or random forest in python.
My data set looks like this:
ID i0   i1    i2    i3    i4   i5     j0    j1   j2   j3    j4    j5    
0  0.19 -0.02 -0.20 0.07 -0.06 -0.06  -0.06 1.48 0.33 -0.46 -0.37 -0.11
1 -0.61 -0.19 -0.10 -0.1 -0.21  0.63   NA    NA   NA   NA    NA    NA
2 -0.31 -0.14 -0.64 -0.5 -0.20 -0.30  -0.08 1.56 -0.2 -0.33  0.81 -0.03
.
.
What I want to do:
On the basis of ID 0 and 2 I want to train the values of j0 to j5 with i0 to i5. Subsequent there should be a prediction of the NA's from j0-j5 for ID 1.
Question:
As the data is not continuous (the time steps end at i5 and start again at j0), is it possible to use some kind of regression?
How should the X and the y for the .fit(X, y) and .predict(X) function look like in this example?
Predicting the missing values with RegressionWe can use the features with non-null values to predict the missing values. A regression or classification model can be built for the prediction of missing values.
One way of handling missing values is the deletion of the rows or columns having null values. If any columns have more than half of the values as null then you can drop the entire column. In the same way, rows can also be dropped if having one or more columns values as null.
The default value in Driverless AI is -5, which specifies that TensorFlow will treat missing values as outliers on the negative end of the spectrum.
Abstract: While data are the primary fuel for machine learning models, they often suffer from missing values, especially when collected in real-world scenarios. However, many off-the-shelf machine learning models, including artificial neural network models, are unable to handle these missing values directly.
In your case, you're looking at at a multi-output regression problem:
You can read more in the sklearn documentation about multiclass.
Here I'm going to show you how you can use sklearn.multioutput.MultiOutputRegressor with a sklearn.ensemble.RandomForestRegressor to predict your values.
from sklearn.datasets import make_regression
X,y = make_regression(n_samples=1000, n_features=6,
                                 n_informative=3, n_targets=6,  
                                 tail_strength=0.5, noise=0.02, 
                                 shuffle=True, coef=False, random_state=0)
# Convert to a pandas dataframe like in your example
icols = ['i0','i1','i2','i3','i4','i5']
jcols = ['j0', 'j1', 'j2', 'j3', 'j4', 'j5']
df = pd.concat([pd.DataFrame(X, columns=icols),
                pd.DataFrame(y, columns=jcols)], axis=1)
# Introduce a few np.nans in there
df.loc[0, jcols] = np.nan
df.loc[10, jcols] = np.nan
df.loc[100, jcols] = np.nan
df.head()
Out:
     i0    i1    i2    i3    i4    i5     j0     j1     j2     j3     j4  \
0 -0.21 -0.18 -0.06  0.27 -0.32  0.00    NaN    NaN    NaN    NaN    NaN   
1  0.65 -2.16  0.46  1.82  0.22 -0.13  33.08  39.85   9.63  13.52  16.72   
2 -0.75 -0.52 -1.08  0.14  1.12 -1.05  -0.96 -96.02  14.37  25.19 -44.90   
3  0.01  0.62  0.20  0.53  0.35 -0.73   6.09 -12.07 -28.88  10.49   0.96   
4  0.39 -0.70 -0.55  0.10  1.65 -0.69  83.15  -3.16  93.61  57.44 -17.33   
      j5  
0    NaN  
1  17.79  
2 -77.48  
3 -35.61  
4  -2.47  
The split is done in order to be able to validate our model.
notnans = df[jcols].notnull().all(axis=1)
df_notnans = df[notnans]
# Split into 75% train and 25% test
X_train, X_test, y_train, y_test = train_test_split(df_notnans[icols], df_notnans[jcols],
                                                    train_size=0.75,
                                                    random_state=4)
from sklearn.ensemble import RandomForestRegressor
from sklearn.multioutput import MultiOutputRegressor
from sklearn.model_selection import train_test_split
regr_multirf = MultiOutputRegressor(RandomForestRegressor(max_depth=30,
                                                          random_state=0))
# Fit on the train data
regr_multirf.fit(X_train, y_train)
# Check the prediction score
score = regr_multirf.score(X_test, y_test)
print("The prediction score on the test data is {:.2f}%".format(score*100))
Out: The prediction score on the test data is 96.76%
df_nans = df.loc[~notnans].copy()
df_nans[jcols] = regr_multirf.predict(df_nans[icols])
df_nans
Out:
           i0        i1        i2        i3        i4        i5         j0  \
0   -0.211620 -0.177927 -0.062205  0.267484 -0.317349  0.000341 -41.254983   
10   1.138974 -1.326378  0.123960  0.982841  0.273958  0.414307  46.406351   
100 -0.682390 -1.431414 -0.328235 -0.886463  1.212363 -0.577676  94.971966   
            j1         j2         j3         j4         j5  
0   -18.197513 -31.029952 -14.749244  -5.990595  -9.296744  
10   67.915628  59.750032  15.612843  10.177314  38.226387  
100  -3.724223  65.630692  44.636895 -14.372414  11.947185  
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With