Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scaling issues with scipy.sparse matrix while using scikit

While solving a machine learning problem using scikit (python) I need to do scaling of scipy.sparse matrix before doing the training using SVM in order to achieve higher accuracy. But its clearly mentioned here, that:

scale and StandardScaler accept scipy.sparse matrices as input only when with_mean=False is explicitly passed to the constructor. Otherwise a ValueError will be raised as silently centering would break the sparsity and would often crash the execution by allocating excessive amounts of memory unintentionally.

This means that I cannot have zero mean with this. So how do I scale this sparse matrix so that it has zero mean too along with unit variance. I also need to store this 'scaling' so that I can use the same transformation on the test matrix to scale it as well.

like image 633
VineetChirania Avatar asked Nov 27 '13 10:11

VineetChirania


People also ask

Does Sklearn work with sparse matrices?

Sklearn has many algorithms that accept sparse matrices. The way to know is by checking the fit attribute in the documentation. Look for this: X: {array-like, sparse matrix}.

What is the issue with sparse matrices?

The problem with representing these sparse matrices as dense matrices is that memory is required and must be allocated for each 32-bit or even 64-bit zero value in the matrix. This is clearly a waste of memory resources as those zero values do not contain any information.

What does Sklearn preprocessing scale do?

Performs scaling to unit variance using the Transformer API (e.g. as part of a preprocessing Pipeline ).

What does Standard scaler Sklearn preprocessing StandardScaler do?

Standardize features by removing the mean and scaling to unit variance. where u is the mean of the training samples or zero if with_mean=False , and s is the standard deviation of the training samples or one if with_std=False .


1 Answers

If the matrix is small, you can densify it with X.toarray(). If the matrix is large, then this will probably blow your RAM.

As an alternative to mean-centering and scaling, you can try per-sample normalization with sklearn.preprocessing.Normalizer; this is appropriate for frequency features (e.g. in text classification).

like image 153
Fred Foo Avatar answered Nov 15 '22 09:11

Fred Foo