Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there any way to train a sklearn model by disk data like HDF5 or such ?

In my problem, I have very large dataset which is out of my memory. I would like to train my model by using disk data like HDF5 or such. Does sklearn support this or is there any other alternative ?

like image 734
erogol Avatar asked May 22 '15 13:05

erogol


People also ask

What method does scikit-learn to find the best classification hypothesis for the training data?

Linear discriminant analysis, as you may be able to guess, is a linear classification algorithm and best used when the data has a linear relationship.

What is CLF in scikit-learn?

In the scikit-learn tutorial, it's short for classifier.: We call our estimator instance clf , as it is a classifier.


1 Answers

What you ask for is called out-of-core or streaming learning. It is only possible with a subset of the scikit-learn models that implement the partial_fit method for incremental fitting.

There is an example in the documentation. There is no specific utility to fit models on data in HDF5 in particular but can can adapt this example to fetch the data from any external datasource (e.g. HDF5 data on the local disk or a database over the network, for instance using the pandas SQL adapter).

like image 95
ogrisel Avatar answered Nov 03 '22 09:11

ogrisel