predict_proba or decision_function as estimator "confidence"

Tags:

I'm using LogisticRegression as a model to train an estimator in scikit-learn. The features I use are (mostly) categorical; and so are the labels. Therefore, I use a DictVectorizer and a LabelEncoder, respectively, to encode the values properly.

The training part is fairly straightforward, but I'm having problems with the test part. The simple thing to do is to use the "predict" method of the trained model and get the predicted label. However, for the processing I need to do afterwards, I need the probability for each possible label (class) for each particular instance. I decided to use the "predict_proba" method. However, I get different results for the same test instance, whether I use this method when the instance is by itself or accompanied by others.

Next, is a code that reproduces the problem.

from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import LabelEncoder


X_real = [{'head': u'n\xe3o', 'dep_rel': u'ADVL'}, 
          {'head': u'v\xe3o', 'dep_rel': u'ACC'}, 
          {'head': u'empresa', 'dep_rel': u'SUBJ'}, 
          {'head': u'era', 'dep_rel': u'ACC'}, 
          {'head': u't\xeam', 'dep_rel': u'ACC'}, 
          {'head': u'import\xe2ncia', 'dep_rel': u'PIV'}, 
          {'head': u'balan\xe7o', 'dep_rel': u'SUBJ'}, 
          {'head': u'ocupam', 'dep_rel': u'ACC'}, 
          {'head': u'acesso', 'dep_rel': u'PRED'}, 
          {'head': u'elas', 'dep_rel': u'SUBJ'}, 
          {'head': u'assinaram', 'dep_rel': u'ACC'}, 
          {'head': u'agredido', 'dep_rel': u'SUBJ'}, 
          {'head': u'pol\xedcia', 'dep_rel': u'ADVL'}, 
          {'head': u'se', 'dep_rel': u'ACC'}] 
y_real = [u'AM-NEG', u'A1', u'A0', u'A1', u'A1', u'A1', u'A0', u'A1', u'AM-ADV', u'A0', u'A1', u'A0', u'A2', u'A1']

feat_encoder =  DictVectorizer()
feat_encoder.fit(X_real)

label_encoder = LabelEncoder()
label_encoder.fit(y_real)

model = LogisticRegression()
model.fit(feat_encoder.transform(X_real), label_encoder.transform(y_real))

print "Test 1..."
X_test1 = [{'head': u'governo', 'dep_rel': u'SUBJ'}]
X_test1_encoded = feat_encoder.transform(X_test1)
print "Features Encoded"
print X_test1_encoded
print "Shape"
print X_test1_encoded.shape
print "decision_function:"
print model.decision_function(X_test1_encoded)
print "predict_proba:"
print model.predict_proba(X_test1_encoded)

print "Test 2..."
X_test2 = [{'head': u'governo', 'dep_rel': u'SUBJ'}, 
           {'head': u'atrav\xe9s', 'dep_rel': u'ADVL'}, 
           {'head': u'configuram', 'dep_rel': u'ACC'}]

X_test2_encoded = feat_encoder.transform(X_test2)
print "Features Encoded"
print X_test2_encoded
print "Shape"
print X_test2_encoded.shape
print "decision_function:"
print model.decision_function(X_test2_encoded)
print "predict_proba:"
print model.predict_proba(X_test2_encoded)


print "Test 3..."
X_test3 = [{'head': u'governo', 'dep_rel': u'SUBJ'}, 
           {'head': u'atrav\xe9s', 'dep_rel': u'ADVL'}, 
           {'head': u'configuram', 'dep_rel': u'ACC'},
           {'head': u'configuram', 'dep_rel': u'ACC'},]

X_test3_encoded = feat_encoder.transform(X_test3)
print "Features Encoded"
print X_test3_encoded
print "Shape"
print X_test3_encoded.shape
print "decision_function:"
print model.decision_function(X_test3_encoded)
print "predict_proba:"
print model.predict_proba(X_test3_encoded)

Following is the output obtained:

Test 1...
Features Encoded
  (0, 4)    1.0
Shape
(1, 19)
decision_function:
[[ 0.55372615 -1.02949707 -1.75474347 -1.73324726 -1.75474347]]
predict_proba:
[[ 1.  1.  1.  1.  1.]]
Test 2...
Features Encoded
  (0, 4)    1.0
  (1, 1)    1.0
  (2, 0)    1.0
Shape
(3, 19)
decision_function:
[[ 0.55372615 -1.02949707 -1.75474347 -1.73324726 -1.75474347]
 [-1.07370197 -0.69103629 -0.89306092 -1.51402163 -0.89306092]
 [-1.55921001  1.11775556 -1.92080112 -1.90133404 -1.92080112]]
predict_proba:
[[ 0.59710757  0.19486904  0.26065002  0.32612646  0.26065002]
 [ 0.23950111  0.24715931  0.51348452  0.3916478   0.51348452]
 [ 0.16339132  0.55797165  0.22586546  0.28222574  0.22586546]]
Test 3...
Features Encoded
  (0, 4)    1.0
  (1, 1)    1.0
  (2, 0)    1.0
  (3, 0)    1.0
Shape
(4, 19)
decision_function:
[[ 0.55372615 -1.02949707 -1.75474347 -1.73324726 -1.75474347]
 [-1.07370197 -0.69103629 -0.89306092 -1.51402163 -0.89306092]
 [-1.55921001  1.11775556 -1.92080112 -1.90133404 -1.92080112]
 [-1.55921001  1.11775556 -1.92080112 -1.90133404 -1.92080112]]
predict_proba:
[[ 0.5132474   0.12507868  0.21262531  0.25434403  0.21262531]
 [ 0.20586462  0.15864173  0.4188751   0.30544372  0.4188751 ]
 [ 0.14044399  0.3581398   0.1842498   0.22010613  0.1842498 ]
 [ 0.14044399  0.3581398   0.1842498   0.22010613  0.1842498 ]]

As can be seen, the values obtained with "predict_proba" for the instance in "X_test1" change when that same instance is with others in X_test2. Also, "X_test3" just reproduces the "X_test2" and adds one more instance (that is equal to the last in "X_test2"), but the probability values for all of them change. Why does this happen? Also, I find it really strange that ALL the probabilities for "X_test1" are 1, shouldn't the sum of all be 1?

Now, if instead of using "predict_proba" I use "decision_function", I get the consistency in the values obtained that I need. The problem is that I get negative coefficients, and even some of the positives ones are greater than 1.

So, what should I use? Why do the values of "predict_proba" change that way? Am I not understanding correctly what those values mean?

Thanks in advance for any help you could give me.

UPDATE

As suggested, I changed the code so as to also print the encoded "X_test1", "X_test2" and "X_test3", as well as their shapes. This doesn't appear to be the problem, as the encoding is consistant for the same instances between the test sets.

799

asked Nov 09 '12 04:11

feralvam

1 Answers

As indicated on the question's comments, the error was caused by a bug in the implementation for the version of scikit-learn I was using. The problem was solved updating to the most recent stable version 0.12.1

120

answered Oct 21 '22 14:10

feralvam

Related questions
                            
                                Cythonize a Python function to make it faster
                            
                                lxml parser eats all memory
                            
                                Why is this shell script calling itself as python script?
                            
                                Vectorize over the rows of an array
                            
                                Easiest Way to Transfer Data Over the Internet, Python
                            
                                Working with WTForms FieldList
                            
                                Compiling vim with Python3 (installed via Homebrew) support?
                            
                                Loading document as raw string in yaml with PyYAML
                            
                                Python: Injecting HTML content into a tag using `lxml.html`
                            
                                read, highlight, save PDF programmatically
                            
                                Multiplex on queue.Queue?
                            
                                Compiling Python to native code? [duplicate]
                            
                                can a custom domain be pointed a specific google app engine version domain?
                            
                                How to decode the gzip compressed data returned in a HTTP Response in python?
                            
                                Python: Securing untrusted scripts/subprocess with chroot and chjail?
                            
                                How can a function refer stably to itself?
                            
                                Better rounding in Python's NumPy.around: Rounding NumPy Arrays
                            
                                argparse: How can I allow multiple values to override a default
                            
                                Google Analytics API access without local browser in python
                            
                                Creating tuples of all possible combinations of items from two lists, without duplicating items within tuples

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

predict_proba or decision_function as estimator "confidence"

Tags:

python

machine-learning

scikit-learn

feralvam

People also ask

1 Answers

feralvam

Recent Activity

Donate For Us