Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trouble understanding output from scikit random forest

Say I have a dataset like this:

5.9;0.645;0.12;2;0.075;32;44;0.99547;3.57;0.71;10.2;5
6;0.31;0.47;3.6;0.067;18;42;0.99549;3.39;0.66;11;6

where the 1st 11 columns indicate features (acidity, chlorides, etc) and the last column indicates the rating given to the item (eg. 5 or 6)

The dataset is trained thus:

target = [x[11] for x in dataset]
train = [x[0:11] for x in dataset]

rf = RandomForestClassifier(n_estimators=120, n_jobs=-1)
rf.fit(train, target)

predictions = rf.predict_proba(testdataset)
print predictions[0] 

which prints something like

[ 0.          0.01666667  0.98333333  0.          0.          0.        ]

Now, why does it not output a single classification, eg a 5 or a 6 rating?

The documentation says "The predicted class probabilities of an input sample is computed as the mean predicted class probabilities of the trees in the forest" which I'm having trouble understanding.

If you use

print rf.predict(testdataset[-1])
[ 6.  6.  6.  6.  6.  6.  6.  6.  6.  6.  6.]

It prints something more like you'd expect - at least it looks like ratings - but I still don't understand why there's a prediction per feature and not a single prediction taking into account all features?

like image 536
scc Avatar asked Jan 08 '13 02:01

scc


2 Answers

In addition to Diego's answer:

RandomForestClassifier is a classifier to predict class assignment for a discrete number of classes without ordering between the class labels.

If you want to output continuous, floating point rating, you should try to use a regression model such as RandomForestRegressor instead.

You might have to clamp the output to the range [0, 6] as there is no guaranty the model will not output predictions such as 6.2 for instance.

Edit to answer you second point, the predict method expects a list of samples. Hence you should provide it with a list of one sample in your case. Try:

print rf.predict([testdataset[-1]])

or alternatively:

print rf.predict(testdataset[-1:])

I wonder why you don't get an error in that case.

Edit: the ouput does not really make sense: what is the shape of your datasets?

>>> print np.asarray(train).shape

>>> print np.asarray(target).shape

>>> print np.asarray(testdataset).shape
like image 135
ogrisel Avatar answered Sep 28 '22 07:09

ogrisel


From the docs, predict_proba returns:

p : array of shape = [n_samples, n_classes], or a list of n_outputs such arrays if n_outputs > 1. The class probabilities of the input samples. Classes are ordered by arithmetical order.

The key here is the last phrase "Classes are ordered by arithmetical order". My guess is that some of your training samples have a class less than 5, which predict_proba assigned a probability of zero, while classes 5 and 6 have probabilities 0.01666667 and 0.98333333, respectively, while another 3 classes, all > 6, have also probability zero.

like image 36
Diego Avatar answered Sep 28 '22 08:09

Diego