Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scikit-learn categorisation: binomial log regression?

I have texts that are rated on a continous scale from -100 to +100. I am trying to classify them as positive or negative.

How can you perform binomial log regression to get the probability that test data is -100 or +100?

The closest I have got is the SGDClassifier( penalty='l2',alpha=1e-05, n_iter=10), but this doesn't provide the same results as SPSS when I use binomial log regression to predict the probability of -100 and +100. So I'm guessing this is not the right function?

like image 747
Zach Avatar asked May 17 '26 07:05

Zach


1 Answers

SGDClassifier provides access to several linear classifiers, all trained with stochastic gradient decent. It will default to a linear support vector machine, unless you call it with a different loss function. loss='log' will provide a probabilistic logistic regression.

See the documentation at: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier

Alternatively, you could use sklearn.linear_model.LogisticRegression to classify your texts with a logistic regression.

It's not clear to me that you will get exactly the same results as you do with SPSS due to differences in implementation. However, I would not expect to see statistically significant differences.

Edited to add:

My suspicion is that the 99% accuracy you're getting with the SPSS logistic regression is training set accuracy, while the 87% that you're seeing with scikits-learn logistic regression is test set accuracy. I found this question on the datascience stack exchange where a different person is attempting and extremely similar problem, and getting ~99% accuracy on training sets and 90% test set accuracy.

https://datascience.stackexchange.com/questions/987/text-categorization-combining-different-kind-of-features

My recommended path forwards is a follows: Try several different basic classifiers in scikits-learn, including the standard logistic regression and a linear SVM, and also rerun the SPSS logistic regression several times with different train/test subsets of your data and compare the results. If you continue to see a large divergence across classifiers that can't be accounted for by ensuring similar train/test data splits, then post the results that you're seeing into your question, and we can move forward from there.

Good luck!

like image 117
brentlance Avatar answered May 19 '26 22:05

brentlance