I have a very big dataset that can not be loaded in memory. I want to use this dataset as training set of a scikit-learn classifier - for example a <code>LogisticRegression</code>. Is there the possibility to perform a mini batch-training of a scikit-learn classifier where I provide the mini batches?

I believe that some of the classifiers in <code>sklearn</code> have a <code>partial_fit</code> method. This method allows you to pass minibatches of data to the classifier, such that a gradient descent step is performed for each minibatch. You would simply load a minibatch from disk, pass it to <code>partial_fit</code>, release the minibatch from memory, and repeat. If you are particularly interested in doing this for Logistic Regression, then you'll want to use <code>SGDClassifier</code>, which can be set to use logistic regression when <code>loss = 'log'</code>. You simply pass the features and labels for your minibatch to <code>partial_fit</code> in the same way that you would use <code>fit</code>: <code>clf.partial_fit(X_minibatch, y_minibatch)</code> Update: I recently came across the <code>dask-ml</code> library which would make this task very easy by combining <code>dask</code> arrays with <code>partial_fit</code>. There is an example on the linked webpage.

Have a look at the scaling strategies included in the <code>sklearn</code> documentation: http://scikit-learn.org/stable/modules/scaling_strategies.html A good example is provided here: http://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html

Mini batch-training of a scikit-learn classifier where I provide the mini batches

2 Answers

I believe that some of the classifiers in sklearn have a partial_fit method. This method allows you to pass minibatches of data to the classifier, such that a gradient descent step is performed for each minibatch. You would simply load a minibatch from disk, pass it to partial_fit, release the minibatch from memory, and repeat.

If you are particularly interested in doing this for Logistic Regression, then you'll want to use SGDClassifier, which can be set to use logistic regression when loss = 'log'.

You simply pass the features and labels for your minibatch to partial_fit in the same way that you would use fit:

clf.partial_fit(X_minibatch, y_minibatch)

Update:

I recently came across the dask-ml library which would make this task very easy by combining dask arrays with partial_fit. There is an example on the linked webpage.

174

answered Oct 22 '22 10:10

Angus Williams

Have a look at the scaling strategies included in the sklearn documentation: http://scikit-learn.org/stable/modules/scaling_strategies.html

A good example is provided here: http://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html

answered Oct 22 '22 09:10

elson serrao

Related questions
                            
                                Understanding the while loop in Tensorflow
                            
                                Python Pandas Dataframe replace values below treshold
                            
                                XlsxWriter set global font size
                            
                                Iterate over two lists with different lengths
                            
                                What is a statement in coverage.py?
                            
                                Sqlalchemy complex NOT IN another table query
                            
                                Creating a Boxplot with Matplotlib
                            
                                Converting numpy array into dataframe column?
                            
                                Keras Multi-inputs AttributeError: 'NoneType' object has no attribute 'inbound_nodes'
                            
                                How to permute one column in pandas
                            
                                count occurrence of a list in a list of lists
                            
                                ModuleNotFoundError: No module named 'models'
                            
                                What's the idiomatic way to perform an aggregate and rename operation in pandas
                            
                                Does Pycharm have Docstring Conventions checks (PEP 257)?
                            
                                Pytest skip test with certain parameter value
                            
                                Reading .eml files with Python 3.6 using emaildata 0.3.4
                            
                                Jinja2 Padding and Aligning Strings
                            
                                Pandas unable to open this Excel file
                            
                                functional difference between lookarounds and non-capture group?
                            
                                Manually changing learning_rate in tf.train.AdamOptimizer

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Mini batch-training of a scikit-learn classifier where I provide the mini batches

Tags:

python

scikit-learn

bigdata

Ulderique Demoitre

People also ask

2 Answers

Angus Williams

elson serrao

Recent Activity

Donate For Us