Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mini batch-training of a scikit-learn classifier where I provide the mini batches

I have a very big dataset that can not be loaded in memory.

I want to use this dataset as training set of a scikit-learn classifier - for example a LogisticRegression.

Is there the possibility to perform a mini batch-training of a scikit-learn classifier where I provide the mini batches?

like image 443
Ulderique Demoitre Avatar asked Oct 25 '17 08:10

Ulderique Demoitre


People also ask

What is mini batch training?

Mini-batch gradient descent is a variation of the gradient descent algorithm that splits the training dataset into small batches that are used to calculate model error and update model coefficients. Implementations may choose to sum the gradient over the mini-batch which further reduces the variance of the gradient.

What is mini batch and batch?

Batch means that you use all your data to compute the gradient during one iteration. Mini-batch means you only take a subset of all your data during one iteration.

Why do we use mini batches?

In mini-batch GD, we use a subset of the dataset to take another step in the learning process. Therefore, our mini-batch can have a value greater than one, and less than the size of the complete training set.

What is the difference between batch and mini-batch in machine learning?

The batching allows both the efficiency of not having all training data in memory and algorithm implementations. Mini-batch requires the configuration of an additional “mini-batch size” hyperparameter for the learning algorithm. Error information must be accumulated across mini-batches of training examples like batch gradient descent.

What is MiniMini-batch k-means clustering?

Mini-Batch K-Means clustering. Read more in the User Guide. The number of clusters to form as well as the number of centroids to generate. Method for initialization: ‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence.

What is MiniMini Batch Gradient descent?

Mini-batch gradient descent is a variation of the gradient descent algorithm that splits the training dataset into small batches that are used to calculate model error and update model coefficients. Implementations may choose to sum the gradient over the mini-batch which further reduces the variance of the gradient.

What is a good batch size for machine learning?

Small values give a learning process that converges quickly at the cost of noise in the training process. Large values give a learning process that converges slowly with accurate estimates of the error gradient. Tip 1: A good default for batch size might be 32.


2 Answers

I believe that some of the classifiers in sklearn have a partial_fit method. This method allows you to pass minibatches of data to the classifier, such that a gradient descent step is performed for each minibatch. You would simply load a minibatch from disk, pass it to partial_fit, release the minibatch from memory, and repeat.

If you are particularly interested in doing this for Logistic Regression, then you'll want to use SGDClassifier, which can be set to use logistic regression when loss = 'log'.

You simply pass the features and labels for your minibatch to partial_fit in the same way that you would use fit:

clf.partial_fit(X_minibatch, y_minibatch)

Update:

I recently came across the dask-ml library which would make this task very easy by combining dask arrays with partial_fit. There is an example on the linked webpage.

like image 174
Angus Williams Avatar answered Oct 22 '22 10:10

Angus Williams


Have a look at the scaling strategies included in the sklearn documentation: http://scikit-learn.org/stable/modules/scaling_strategies.html

A good example is provided here: http://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html

like image 32
elson serrao Avatar answered Oct 22 '22 09:10

elson serrao