Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sklearn linear regression for large data

Does sklearn.LinearRegression support online/incremental learning?

I have 100 groups of data, and I am trying to implement them altogether. For each group, there are over 10000 instances and ~ 10 features, so it will lead to memory error with sklearn if I construct a huge matrix (10^6 by 10). It will be nice if I can update the regressor each time with batch samples of new group.

I found this post relevant, but the accepted solution works for online learning with single new data (only one instance) rather than batch samples.

like image 965
ChuNan Avatar asked Mar 26 '14 17:03

ChuNan


People also ask

How much data can sklearn handle?

They are similar to pandas but working on large scale data (using out-of-core dataframes). The problem with pandas is all data has to fit into memory. Both frameworks can be used with scikit learn. You can load 22 GB of data into Dask or SFrame, then use with sklearn.

What is the difference between Statsmodels and sklearn linear regression?

A key difference between the two libraries is how they handle constants. Scikit-learn allows the user to specify whether or not to add a constant through a parameter, while statsmodels' OLS class has a function that adds a constant to a given array.

What is sklearn linear_model?

linear_model is a class of the sklearn module if contain different functions for performing machine learning with linear models. The term linear model implies that the model is specified as a linear combination of features.

What does LinearRegression fit () do in Python?

Linear regression performs the task to predict a dependent variable value (y) based on a given independent variable (x). So, this regression technique finds out a linear relationship between x (input) and y(output). Hence, the name is Linear Regression.


2 Answers

Take a look at linear_model.SGDRegressor, it learns a a linear model using stochastic gradient.

In general, sklearn has many models that admit "partial_fit", they are all pretty useful on medium to large datasets that don't fit in the RAM.

like image 159
Yanshuai Cao Avatar answered Sep 19 '22 21:09

Yanshuai Cao


Not all algorithms can learn incrementally, without seeing all of the instances at once that is. That said, all estimators implementing the partial_fit API are candidates for the mini-batch learning, also known as "online learning".

Here is an article that goes over scaling strategies for incremental learning. For your purposes, have a look at the sklearn.linear_model.SGDRegressor class. It is truly online so the memory and convergence rate are not affected by the batch size.

like image 21
Drewness Avatar answered Sep 18 '22 21:09

Drewness