While solving a machine learning problem using scikit (python) I need to do scaling of scipy.sparse matrix before doing the training using SVM in order to achieve higher accuracy. But its clearly mentioned here, that:
scale and StandardScaler accept scipy.sparse matrices as input only when with_mean=False is explicitly passed to the constructor. Otherwise a ValueError will be raised as silently centering would break the sparsity and would often crash the execution by allocating excessive amounts of memory unintentionally.
This means that I cannot have zero mean with this. So how do I scale this sparse matrix so that it has zero mean too along with unit variance. I also need to store this 'scaling' so that I can use the same transformation on the test matrix to scale it as well.
Sklearn has many algorithms that accept sparse matrices. The way to know is by checking the fit attribute in the documentation. Look for this: X: {array-like, sparse matrix}.
The problem with representing these sparse matrices as dense matrices is that memory is required and must be allocated for each 32-bit or even 64-bit zero value in the matrix. This is clearly a waste of memory resources as those zero values do not contain any information.
Performs scaling to unit variance using the Transformer API (e.g. as part of a preprocessing Pipeline ).
Standardize features by removing the mean and scaling to unit variance. where u is the mean of the training samples or zero if with_mean=False , and s is the standard deviation of the training samples or one if with_std=False .
If the matrix is small, you can densify it with X.toarray()
. If the matrix is large, then this will probably blow your RAM.
As an alternative to mean-centering and scaling, you can try per-sample normalization with sklearn.preprocessing.Normalizer
; this is appropriate for frequency features (e.g. in text classification).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With