Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Different versions of sklearn give quite different training results

We upgraded our sklearn from the old 0.13-git to 0.14.1, and find the performance of our logistic regression classifier changed quite a bit. The two classifiers trained with the same data have different coefficients, and thus often give different classification results.

As an experiment I used 5 data points (high dimensional) to train the LR classifier, and the results are:

0.13-git:

clf.fit(data_test.data, y)
LogisticRegression(C=10, class_weight='auto', dual=False, fit_intercept=True,
intercept_scaling=1, penalty='l2', tol=0.0001)
np.sort(clf.coef_)
array([[-0.12442518, -0.11137502, -0.11137502, ..., 0.05428562,
0.07329358, 0.08178794]])

0.14.1:

clf1.fit(data_test.data, y)
LogisticRegression(C=10, class_weight='auto', dual=False, fit_intercept=True,
intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001)
np.sort(clf1.coef_)
array([[-0.11702073, -0.10505662, -0.10505662, ..., 0.05630517,
0.07651478, 0.08534311]])

I would say the difference is quite big, in the range of 10^(-2). Obviously the data I used here is not ideal, because the dimensionality of features is much bigger than the number of entries. However, it is often the case in practice too. Does it have something to do with feature selection? How can I make the results the same as before? I understand the new results are not necessarily worse than before, but now the focus is to make them as consistent as possible. Thanks.

like image 357
ymeng Avatar asked Apr 18 '15 18:04

ymeng


People also ask

How many Sklearn models are there?

Three types of Machine Learning Models can be implemented using the Sklearn Regression Models: Reinforced Learning. Unsupervised Learning. Supervised Learning.

What are the different algorithms in scikit-learn?

Finally we will use three different algorithms (Naive-Bayes, LinearSVC, K-Neighbors Classifier) to make predictions and compare their performance using methods like accuracy_score() provided by the scikit-learn library.

What are requirements for working with data in scikit-learn?

Requirements for working with data in scikit learnFeatures = predictor variables = independent variables. Target variable = dependent variable = response variable. Samples=records=instances.

What is a good score Sklearn?

The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse).


1 Answers

From the release 0.13 changelog:

Fixed class_weight support in svm.LinearSVC and linear_model.LogisticRegression by Andreas Müller. The meaning of class_weight was reversed as erroneously higher weight meant less positives of a given class in earlier releases.

However, the update's description is for the version 0.13, not a higher version. You mention that you used the version 0.13-git, maybe you used a pre-release of the version 0.13 where the feature was not edited: this way, the update could make sense relatively to your problem.

By looking at your coefficients, they are lower in the new version, which makes a bit of sense with the update's description stating that weights were initially lowered.

You might want to change your new LogisticRegression(...)'s parameters and try to adjust things a bit.

like image 167
Guillaume Chevalier Avatar answered Sep 29 '22 11:09

Guillaume Chevalier