How many features can scikit-learn handle?

Question

I have a csv file of [66k, 56k] size (rows, columns). Its a sparse matrix. I know that numpy can handle that size a matrix. I would like to know based on everyone's experience, how many features scikit-learn algorithms can handle comfortably?

Fred Foo · Accepted Answer

Depends on the estimator. At that size, linear models still perform well, while SVMs will probably take forever to train (and forget about random forests since they won't handle sparse matrices).

I've personally used LinearSVC, LogisticRegression and SGDClassifier with sparse matrices of size roughly 300k × 3.3 million without any trouble. See @amueller's scikit-learn cheat sheet for picking the right estimator for the job at hand.

Full disclosure: I'm a scikit-learn core developer.

Steve · Answer

Some linear model (Regression, SGD, Bayes) will probably be your best bet if you need to train your model frequently.

Although before you go running any models you could try the following

1) Feature reduction. Are there features in your data that could easily be removed? For example if your data is text or ratings based there are lots known options available.

2) Learning curve analysis. Maybe you only need a small subset of your data to train a model, and after that you are only fitting to your data or gaining tiny increases in accuracy.

Both approaches could allow you to greatly reduce the training data required.

How many features can scikit-learn handle?

Tags:

python

machine-learning

numpy

scipy

scikit-learn

viper

2 Answers

Fred Foo

Steve

Recent Activity

Donate For Us

How many features can scikit-learn handle?

Tags:

python

machine-learning

numpy

scipy

scikit-learn

viper

2 Answers

Fred Foo

Steve

Related questions

Recent Activity

Donate For Us