Scaling issues with scipy.sparse matrix while using scikit

Tags:

While solving a machine learning problem using scikit (python) I need to do scaling of scipy.sparse matrix before doing the training using SVM in order to achieve higher accuracy. But its clearly mentioned here, that:

scale and StandardScaler accept scipy.sparse matrices as input only when with_mean=False is explicitly passed to the constructor. Otherwise a ValueError will be raised as silently centering would break the sparsity and would often crash the execution by allocating excessive amounts of memory unintentionally.

This means that I cannot have zero mean with this. So how do I scale this sparse matrix so that it has zero mean too along with unit variance. I also need to store this 'scaling' so that I can use the same transformation on the test matrix to scale it as well.

633

asked Nov 27 '13 10:11

VineetChirania

1 Answers

If the matrix is small, you can densify it with X.toarray(). If the matrix is large, then this will probably blow your RAM.

As an alternative to mean-centering and scaling, you can try per-sample normalization with sklearn.preprocessing.Normalizer; this is appropriate for frequency features (e.g. in text classification).

153

answered Nov 15 '22 09:11

Fred Foo

Related questions
                            
                                Tkinter/Matplotlib backend conflict causes infinite mainloop
                            
                                Saving a numpy array with mixed data
                            
                                Python: Import file in grandparent directory
                            
                                How to catch an Exception like this on Flask?
                            
                                Push a file to Heroku that's not in my git repo.
                            
                                matplotlib contour plot with lognorm - colorbar levels
                            
                                Python recursive setattr()-like function for working with nested dictionaries
                            
                                Slice pandas series with elements not in the index
                            
                                Writing a simple function using while
                            
                                Looking to quantify the performance overhead of NewRelic monitoring in python django app
                            
                                How to sort a boxplot by the median values in pandas
                            
                                How to create a complete menu using GIO Actions in PyGI GTK?
                            
                                Keeping to 79 char line limit in Python with multiple indents
                            
                                Parsing nested JSON data
                            
                                When is a context manager's __exit__ triggered when inside a generator?
                            
                                Django allowed hosts with port number
                            
                                Django: where do I call settings.configure?
                            
                                Diff on pandas dataframe with more than one column
                            
                                Large scale server application using DDD with Python?
                            
                                NumPy PolyFit and PolyVal in Multiple Dimensions?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scaling issues with scipy.sparse matrix while using scikit

Tags:

python

machine-learning

scikit-learn

VineetChirania

People also ask

1 Answers

Fred Foo

Recent Activity

Donate For Us