Use of scikit Random Forest sample_weights

Tags:

I've been trying to figure out scikit's Random Forest sample_weight use and I cannot explain some of the results I'm seeing. Fundamentally I need it to balance a classification problem with unbalanced classes.

In particular, I was expecting that if I used a sample_weights array of all 1's I would get the same result as w sample_weights=None. Additionally, I was expeting that any array of equal weights (i.e. all 1s, or all 10s or all 0.8s...) would provide the same result. Perhaps my intuition of weights is wrong in this case.

Here's the code:

import numpy as np
from sklearn import ensemble,metrics, cross_validation, datasets

#create a synthetic dataset with unbalanced classes
X,y = datasets.make_classification(
n_samples=10000, 
n_features=20, 
n_informative=4, 
n_redundant=2, 
n_repeated=0, 
n_classes=2, 
n_clusters_per_class=2, 
weights=[0.9],
flip_y=0.01,
class_sep=1.0, 
hypercube=True, 
shift=0.0, 
scale=1.0, 
shuffle=True, 
random_state=0)

model = ensemble.RandomForestClassifier()

w0=1 #weight associated to 0's
w1=1 #weight associated to 1's

#I should split train and validation but for the sake of understanding sample_weights I'll skip this step
model.fit(X, y,sample_weight=np.array([w0 if r==0 else w1 for r in y]))    
preds = model.predict(X)
probas = model.predict_proba(X)
ACC = metrics.accuracy_score(y,preds)
precision, recall, thresholds = metrics.precision_recall_curve(y, probas[:, 1])
fpr, tpr, thresholds = metrics.roc_curve(y, probas[:, 1])
ROC = metrics.auc(fpr, tpr)
cm = metrics.confusion_matrix(y,preds)
print "ACCURACY:", ACC
print "ROC:", ROC
print "F1 Score:", metrics.f1_score(y,preds)
print "TP:", cm[1,1], cm[1,1]/(cm.sum()+0.0)
print "FP:", cm[0,1], cm[0,1]/(cm.sum()+0.0)
print "Precision:", cm[1,1]/(cm[1,1]+cm[0,1]*1.1)
print "Recall:", cm[1,1]/(cm[1,1]+cm[1,0]*1.1)

With w0=w1=1 I get, for instance, F1=0.9456.
With w0=w1=10 I get, for instance, F1=0.9569.
With sample_weights=None I get F1=0.9474.

303

asked Jan 28 '14 22:01

ADJ

1 Answers

With the Random Forest algorithm, there is, as the name implies, some "Random"ness to it.

You are getting different F1 score because the Random Forest Algorithm (RFA) is using a subset of your data to generate the decision trees, and then averaging across all of your trees. I am not surprised, therefore, that you have similar (but non-identical) F1 scores for each of your runs.

I have tried balancing the weights before. You may want to try balancing the weights by the size of each class in the population. For example, if you were to have two classes as such:

Class A: 5 members
Class B: 2 members

You may wish to balance the weights by assigning 2/7 for each of Class A's members and 5/7 for each of Class B's members. That's just an idea as a starting place, though. How you weight your classes will depend on the problem you have.

answered Sep 24 '22 22:09

ericmjl

Related questions
                            
                                Using Flask with apscheduler
                            
                                sorting numpy structured and record arrays is very slow
                            
                                recv() function too slow
                            
                                keyring module is not included while packaging with py2exe
                            
                                QImage to Numpy Array using PySide
                            
                                Use of eval in Python, MATLAB, etc [duplicate]
                            
                                how can I make a sip call with twisted sip protocol?
                            
                                How to install xml.dom.minidom in python
                            
                                ConfigParser - Write to existing section
                            
                                How do I test the setup.py for my package?
                            
                                Python Mlab - cannot import name find_available_releases
                            
                                How to set up logging for a Python Pyramid Waitress server?
                            
                                Why is my gunicorn process ignoring the log-level setting with Django?
                            
                                URL-encoding and -decoding a string in Python
                            
                                Scipy Sparse - distance matrix (Scikit or Scipy)
                            
                                Django custom response headers
                            
                                Change individual vertex attributes in python igraph
                            
                                pymssql: How to use windows authentication when running on a non-windows box
                            
                                How to copy text to / from clipboard in Go? [closed]
                            
                                What's the benefit of text.usetex : True in matplotlib

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Use of scikit Random Forest sample_weights

Tags:

python

scikit-learn

random-forest

ADJ

People also ask

1 Answers

ericmjl

Recent Activity

Donate For Us