Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How many features can scikit-learn handle?

I have a csv file of [66k, 56k] size (rows, columns). Its a sparse matrix. I know that numpy can handle that size a matrix. I would like to know based on everyone's experience, how many features scikit-learn algorithms can handle comfortably?

like image 404
viper Avatar asked May 01 '13 21:05

viper


2 Answers

Depends on the estimator. At that size, linear models still perform well, while SVMs will probably take forever to train (and forget about random forests since they won't handle sparse matrices).

I've personally used LinearSVC, LogisticRegression and SGDClassifier with sparse matrices of size roughly 300k × 3.3 million without any trouble. See @amueller's scikit-learn cheat sheet for picking the right estimator for the job at hand.

Full disclosure: I'm a scikit-learn core developer.

like image 126
Fred Foo Avatar answered Sep 30 '22 12:09

Fred Foo


Some linear model (Regression, SGD, Bayes) will probably be your best bet if you need to train your model frequently.

Although before you go running any models you could try the following

1) Feature reduction. Are there features in your data that could easily be removed? For example if your data is text or ratings based there are lots known options available.

2) Learning curve analysis. Maybe you only need a small subset of your data to train a model, and after that you are only fitting to your data or gaining tiny increases in accuracy.

Both approaches could allow you to greatly reduce the training data required.

like image 38
Steve Avatar answered Sep 30 '22 12:09

Steve