I'm trying to use a random forest to classify my data, but when I generate the classifier probability, it always has a value like 0, 0.1, 0.2, ... 1 within 5 digits. Is this a statistics problem or a software problem? I'm using RandomForestClassifier in scikit-learn ensemble for Python 2.7.3 on a Mac 10.7.5. My data looks something like this:
y x1 x2 x3 x4...
0 23 4 0
1 102 2 0
1 12 17 1
The response variable, y, is binary. There are 15 features, all are either real or integer values, some of which are binary. I have about 2000 training points and 500 testing. I set the number of trees to 500 and the number of features to try per tree to 8 and use defaults for everything else. After training the model, I generate the probabilities using "predict_proba" function and get results like 0.90000000000000002 or 0.10000000000000001
I thought this problem may be caused by a particular variable, so I trained the model using just one variable at a time repeated over five variables. Probabilities for each variable alone have normal values like 0.5532. When I use two variables together, a few 0.70000, etc. values start to appear. When I use even more variables, I get a larger fraction of 0.700000 type values.
Is this a statistics or software problem? Numpy passed the test: numpy.test(), but scipy.test() and sklearn.test() both failed. I've used sci-kit learn packages in the past where the tests have failed without this problem. Also, I know that I should fix the packages, however I've spent 20 hours installing from source, then binary packages, then reading over 30 webpages of how other people have installed it or what bugs they had. When they say installation is easy, I don't see them testing the packages. Thanks.
The default number of trees built by sklearn's decision forest is 10. It seems possible that you're not correctly changing that, as with exactly 10 trees in the forest, that's what the output would look like (the probability is the fraction of trees giving class 1, so the values will be 0, .1, .2, ..., 1).
Can you check the parameters assigned to see whether it's actually building 500 trees?
>>> import sklearn.ensemble
>>> rf = sklearn.ensemble.RandomForestRegressor()
>>> rf.n_estimators
10
>>> rf = sklearn.ensemble.RandomForestRegressor(n_estimators=500)
>>> rf.n_estimators
500
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With