Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use of scikit Random Forest sample_weights

I've been trying to figure out scikit's Random Forest sample_weight use and I cannot explain some of the results I'm seeing. Fundamentally I need it to balance a classification problem with unbalanced classes.

In particular, I was expecting that if I used a sample_weights array of all 1's I would get the same result as w sample_weights=None. Additionally, I was expeting that any array of equal weights (i.e. all 1s, or all 10s or all 0.8s...) would provide the same result. Perhaps my intuition of weights is wrong in this case.

Here's the code:

import numpy as np
from sklearn import ensemble,metrics, cross_validation, datasets

#create a synthetic dataset with unbalanced classes
X,y = datasets.make_classification(
n_samples=10000, 
n_features=20, 
n_informative=4, 
n_redundant=2, 
n_repeated=0, 
n_classes=2, 
n_clusters_per_class=2, 
weights=[0.9],
flip_y=0.01,
class_sep=1.0, 
hypercube=True, 
shift=0.0, 
scale=1.0, 
shuffle=True, 
random_state=0)

model = ensemble.RandomForestClassifier()

w0=1 #weight associated to 0's
w1=1 #weight associated to 1's

#I should split train and validation but for the sake of understanding sample_weights I'll skip this step
model.fit(X, y,sample_weight=np.array([w0 if r==0 else w1 for r in y]))    
preds = model.predict(X)
probas = model.predict_proba(X)
ACC = metrics.accuracy_score(y,preds)
precision, recall, thresholds = metrics.precision_recall_curve(y, probas[:, 1])
fpr, tpr, thresholds = metrics.roc_curve(y, probas[:, 1])
ROC = metrics.auc(fpr, tpr)
cm = metrics.confusion_matrix(y,preds)
print "ACCURACY:", ACC
print "ROC:", ROC
print "F1 Score:", metrics.f1_score(y,preds)
print "TP:", cm[1,1], cm[1,1]/(cm.sum()+0.0)
print "FP:", cm[0,1], cm[0,1]/(cm.sum()+0.0)
print "Precision:", cm[1,1]/(cm[1,1]+cm[0,1]*1.1)
print "Recall:", cm[1,1]/(cm[1,1]+cm[1,0]*1.1)
  • With w0=w1=1 I get, for instance, F1=0.9456.
  • With w0=w1=10 I get, for instance, F1=0.9569.
  • With sample_weights=None I get F1=0.9474.
like image 303
ADJ Avatar asked Jan 28 '14 22:01

ADJ


People also ask

What is random forest classifier used for?

From there, the random forest classifier can be used to solve for regression or classification problems. The random forest algorithm is made up of a collection of decision trees, and each tree in the ensemble is comprised of a data sample drawn from a training set with replacement, called the bootstrap sample.

What is Sklearn random forest?

A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

What is the use of random forest algorithm in machine learning?

A Random Forest Algorithm is a supervised machine learning algorithm which is extremely popular and is used for Classification and Regression problems in Machine Learning. We know that a forest comprises numerous trees, and the more trees more it will be robust.


1 Answers

With the Random Forest algorithm, there is, as the name implies, some "Random"ness to it.

You are getting different F1 score because the Random Forest Algorithm (RFA) is using a subset of your data to generate the decision trees, and then averaging across all of your trees. I am not surprised, therefore, that you have similar (but non-identical) F1 scores for each of your runs.

I have tried balancing the weights before. You may want to try balancing the weights by the size of each class in the population. For example, if you were to have two classes as such:

Class A: 5 members
Class B: 2 members

You may wish to balance the weights by assigning 2/7 for each of Class A's members and 5/7 for each of Class B's members. That's just an idea as a starting place, though. How you weight your classes will depend on the problem you have.

like image 65
ericmjl Avatar answered Sep 24 '22 22:09

ericmjl