Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SKlearn SGD Partial Fit

What I am doing wrong here? I have a large data set that I want to perform a partial fit on using Scikit-learn's SGDClassifier

I do the following

from sklearn.linear_model import SGDClassifier
import pandas as pd

chunksize = 5
clf2 = SGDClassifier(loss='log', penalty="l2")

for train_df in pd.read_csv("train.csv", chunksize=chunksize, iterator=True):
    X = train_df[features_columns]
    Y = train_df["clicked"]
    clf2.partial_fit(X, Y)

I'm getting the error

Traceback (most recent call last): File "/predict.py", line 48, in sys.exit(0 if main() else 1) File "/predict.py", line 44, in main predict() File "/predict.py", line 38, in predict clf2.partial_fit(X, Y) File "/Users/anaconda/lib/python3.5/site-packages/sklearn/linear_model/stochastic_gradient.py", line 512, in partial_fit coef_init=None, intercept_init=None) File "/Users/anaconda/lib/python3.5/site-packages/sklearn/linear_model/stochastic_gradient.py", line 349, in _partial_fit _check_partial_fit_first_call(self, classes) File "/Users/anaconda/lib/python3.5/site-packages/sklearn/utils/multiclass.py", line 297, in _check_partial_fit_first_call raise ValueError("classes must be passed on the first call " ValueError: classes must be passed on the first call to partial_fit.

like image 441
Kabard Avatar asked Feb 09 '17 21:02

Kabard


People also ask

What is partial fit in sklearn?

partial_fit is a handy API that can be used to perform incremental learning in a mini-batch of an out-of-memory dataset. The primary purpose of using warm_state is to reducing training time when fitting the same dataset with different sets of hyperparameter values.

Why the SGDClassifier function is faster than Logistic Regression?

Stochastic gradient descent considers only 1 random point while changing weights unlike gradient descent which considers the whole training data. As such stochastic gradient descent is much faster than gradient descent when dealing with large data sets.

What happens when parameter Warm_start is set to TRUE while building a SGD linear model?

When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution. See the Glossary. Repeatedly calling fit or partial_fit when warm_start is True can result in a different solution than when calling fit a single time because of the way the data is shuffled.

What is eta0 in SGD?

eta0float, default=0.01. The initial learning rate for the 'constant', 'invscaling' or 'adaptive' schedules. The default value is 0.01. power_tfloat, default=0.25. The exponent for inverse scaling learning rate.


1 Answers

Please notice that the classifier does not know the number of classes at the beginning, therefore for the first pass, you need to tell the number of classes using np.unique(target), where target is the class column. Because you are reading the data in chunks, you need to make sure that your first chunk has all possible values for the class label, so it works! Therefore, your code would be:

for train_df in pd.read_csv("train.csv", chunksize=chunksize, iterator=True):
   X = train_df[features_columns]
   Y = train_df["clicked"]
   clf2.partial_fit(X, Y, classes=np.unique(Y))
like image 89
Alaleh Rz Avatar answered Sep 28 '22 06:09

Alaleh Rz