SKlearn SGD Partial Fit

Tags:

scikit-learn

What I am doing wrong here? I have a large data set that I want to perform a partial fit on using Scikit-learn's SGDClassifier

I do the following

from sklearn.linear_model import SGDClassifier
import pandas as pd

chunksize = 5
clf2 = SGDClassifier(loss='log', penalty="l2")

for train_df in pd.read_csv("train.csv", chunksize=chunksize, iterator=True):
    X = train_df[features_columns]
    Y = train_df["clicked"]
    clf2.partial_fit(X, Y)

I'm getting the error

Traceback (most recent call last): File "/predict.py", line 48, in sys.exit(0 if main() else 1) File "/predict.py", line 44, in main predict() File "/predict.py", line 38, in predict clf2.partial_fit(X, Y) File "/Users/anaconda/lib/python3.5/site-packages/sklearn/linear_model/stochastic_gradient.py", line 512, in partial_fit coef_init=None, intercept_init=None) File "/Users/anaconda/lib/python3.5/site-packages/sklearn/linear_model/stochastic_gradient.py", line 349, in _partial_fit _check_partial_fit_first_call(self, classes) File "/Users/anaconda/lib/python3.5/site-packages/sklearn/utils/multiclass.py", line 297, in _check_partial_fit_first_call raise ValueError("classes must be passed on the first call " ValueError: classes must be passed on the first call to partial_fit.

441

asked Feb 09 '17 21:02

1 Answers

Please notice that the classifier does not know the number of classes at the beginning, therefore for the first pass, you need to tell the number of classes using np.unique(target), where target is the class column. Because you are reading the data in chunks, you need to make sure that your first chunk has all possible values for the class label, so it works! Therefore, your code would be:

for train_df in pd.read_csv("train.csv", chunksize=chunksize, iterator=True):
   X = train_df[features_columns]
   Y = train_df["clicked"]
   clf2.partial_fit(X, Y, classes=np.unique(Y))

answered Sep 28 '22 06:09

Alaleh Rz

Related questions
                            
                                celery doesn't work with global variable
                            
                                seaborn pairgrid: using kdeplot with 2 hues
                            
                                How to use multiple cores with py.test?
                            
                                Can Interface Segregation Principle be applied to Python objects?
                            
                                Parallel jobs don't finish in scikit-learn's GridSearchCV
                            
                                Recursive factorial using dict causes RecursionError
                            
                                Extensive list of Jenkins job statuses?
                            
                                Tensorflow error "shape Tensorshape() must have rank 1"
                            
                                Django custom login page
                            
                                xgboost: AttributeError: 'DMatrix' object has no attribute 'handle'
                            
                                Specify file pattern in pysftp get
                            
                                Drawing angled rectangles in OpenCV
                            
                                Format in python by variable length
                            
                                tensorflow cifar10_eval.py error:RuntimeError: Attempted to use a closed Session.RuntimeError: Attempted to use a closed Session
                            
                                How does pip decide which many linux wheel to use?
                            
                                ValueError: Cannot cast DatetimeIndex to dtype datetime64[us]
                            
                                Apply fuzzy matching across a dataframe column and save results in a new column
                            
                                Pandas - scatter matrix set title
                            
                                Python requests and streaming - AttributeError: 'X509' object has no attribute '_x509'
                            
                                Converting Float to Int on certain columns in a data frame

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

SKlearn SGD Partial Fit

Tags:

python

scikit-learn

Kabard

People also ask

1 Answers

Alaleh Rz

Recent Activity

Donate For Us