Say I have a dataset like this:
5.9;0.645;0.12;2;0.075;32;44;0.99547;3.57;0.71;10.2;5
6;0.31;0.47;3.6;0.067;18;42;0.99549;3.39;0.66;11;6
where the 1st 11 columns indicate features (acidity, chlorides, etc) and the last column indicates the rating given to the item (eg. 5 or 6)
The dataset is trained thus:
target = [x[11] for x in dataset]
train = [x[0:11] for x in dataset]
rf = RandomForestClassifier(n_estimators=120, n_jobs=-1)
rf.fit(train, target)
predictions = rf.predict_proba(testdataset)
print predictions[0]
which prints something like
[ 0. 0.01666667 0.98333333 0. 0. 0. ]
Now, why does it not output a single classification, eg a 5 or a 6 rating?
The documentation says "The predicted class probabilities of an input sample is computed as the mean predicted class probabilities of the trees in the forest" which I'm having trouble understanding.
If you use
print rf.predict(testdataset[-1])
[ 6. 6. 6. 6. 6. 6. 6. 6. 6. 6. 6.]
It prints something more like you'd expect - at least it looks like ratings - but I still don't understand why there's a prediction per feature and not a single prediction taking into account all features?
In addition to Diego's answer:
RandomForestClassifier
is a classifier to predict class assignment for a discrete number of classes without ordering between the class labels.
If you want to output continuous, floating point rating, you should try to use a regression model such as RandomForestRegressor
instead.
You might have to clamp the output to the range [0, 6] as there is no guaranty the model will not output predictions such as 6.2
for instance.
Edit to answer you second point, the predict
method expects a list of samples. Hence you should provide it with a list of one sample in your case. Try:
print rf.predict([testdataset[-1]])
or alternatively:
print rf.predict(testdataset[-1:])
I wonder why you don't get an error in that case.
Edit: the ouput does not really make sense: what is the shape of your datasets?
>>> print np.asarray(train).shape
>>> print np.asarray(target).shape
>>> print np.asarray(testdataset).shape
From the docs, predict_proba
returns:
p : array of shape = [n_samples, n_classes], or a list of n_outputs such arrays if n_outputs > 1. The class probabilities of the input samples. Classes are ordered by arithmetical order.
The key here is the last phrase "Classes are ordered by arithmetical order".
My guess is that some of your training samples have a class less than 5, which predict_proba
assigned a probability of zero, while classes 5 and 6 have probabilities 0.01666667 and 0.98333333, respectively, while another 3 classes, all > 6, have also probability zero.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With