Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Sklearn - RandomForest and Missing values

I'm trying to perfome RandomForest on a dataset that contain missing values.

My data set looks like :

train_data = [['1' 'NaN' 'NaN' '0.0127034' '0.0435092']
 ['1' 'NaN' 'NaN' '0.0113187' '0.228205']
 ['1' '0.648' '0.248' '0.0142176' '0.202707']
 ..., 
 ['1' '0.357' '0.470' '0.0328121' '0.255039']
 ['1' 'NaN' 'NaN' '0.00311825' '0.0381745']
 ['1' 'NaN' 'NaN' '0.0332604' '0.2857']]

To impute the "NaN" value, I'm using:

from sklearn.preprocessing import Imputer

imp=Imputer(missing_values='NaN',strategy='mean',axis=0)
imp.fit(train_data[0::,1::])
new_train_data=imp.transform(train_data)

But I'm getting the following error:

Traceback (most recent call last):
  File "./RandomForest.py", line 72, in <module>
    new_train_data=imp.transform(train_data)
  File "/home/aurore/.local/lib/python2.7/site-packages/sklearn/preprocessing    /imputation.py", line 388, in transform
    values = np.repeat(valid_statistics, n_missing)
  File "/usr/lib/python2.7/dist-packages/numpy/core/fromnumeric.py", line 343, in repeat
    return repeat(repeats, axis)
ValueError: a.shape[axis] != len(repeats)

I did it:

new_train_data = imp.fit_transform(train_data)

Then I get this error:

Traceback (most recent call last):
  File "./RandomForest.py", line 82, in <module>
    forest = forest.fit(train_data[0::,1::],train_data[0::,0])
  File "/home/aurore/.local/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 224, in fit
    X, = check_arrays(X, dtype=DTYPE, sparse_format="dense")
  File "/home/aurore/.local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 283, in check_arrays
    _assert_all_finite(array)
  File "/home/aurore/.local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 43, in _assert_all_finite
    " or a value too large for %r." % X.dtype)
 ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

Is there some problem with the package? Can someone please help me? What does it mean?

like image 605
Aurore Vaitinadapoule Avatar asked Oct 21 '22 02:10

Aurore Vaitinadapoule


1 Answers

You train the imputer on columns 1::, but then you try to apply it to all columns. That doesn't work. Do

new_train_data = imp.fit_transform(train_data)
like image 140
Fred Foo Avatar answered Oct 22 '22 16:10

Fred Foo