Scikit-learn categorisation: binomial log regression?

Question

I have texts that are rated on a continous scale from -100 to +100. I am trying to classify them as positive or negative.

How can you perform binomial log regression to get the probability that test data is -100 or +100?

The closest I have got is the SGDClassifier( penalty='l2',alpha=1e-05, n_iter=10), but this doesn't provide the same results as SPSS when I use binomial log regression to predict the probability of -100 and +100. So I'm guessing this is not the right function?

brentlance · Accepted Answer

SGDClassifier provides access to several linear classifiers, all trained with stochastic gradient decent. It will default to a linear support vector machine, unless you call it with a different loss function. loss='log' will provide a probabilistic logistic regression.

See the documentation at: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier

Alternatively, you could use sklearn.linear_model.LogisticRegression to classify your texts with a logistic regression.

It's not clear to me that you will get exactly the same results as you do with SPSS due to differences in implementation. However, I would not expect to see statistically significant differences.

Edited to add:

My suspicion is that the 99% accuracy you're getting with the SPSS logistic regression is training set accuracy, while the 87% that you're seeing with scikits-learn logistic regression is test set accuracy. I found this question on the datascience stack exchange where a different person is attempting and extremely similar problem, and getting ~99% accuracy on training sets and 90% test set accuracy.

https://datascience.stackexchange.com/questions/987/text-categorization-combining-different-kind-of-features

My recommended path forwards is a follows: Try several different basic classifiers in scikits-learn, including the standard logistic regression and a linear SVM, and also rerun the SPSS logistic regression several times with different train/test subsets of your data and compare the results. If you continue to see a large divergence across classifiers that can't be accounted for by ensuring similar train/test data splits, then post the results that you're seeing into your question, and we can move forward from there.

Good luck!

Scikit-learn categorisation: binomial log regression?

Tags:

python

classification

scikit-learn

Zach

1 Answers

brentlance

Recent Activity

Donate For Us

Scikit-learn categorisation: binomial log regression?

Tags:

python

classification

scikit-learn

Zach

1 Answers

brentlance

Related questions

Recent Activity

Donate For Us