Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python MemoryError when doing fitting with Scikit-learn

I am running Python 2.7 (64-bit) on a Windows 8 64-bit system with 24GB memory. When doing the fitting of the usual Sklearn.linear_models.Ridge, the code runs fine.

Problem: However when using Sklearn.linear_models.RidgeCV(alphas=alphas) for the fitting, I run into the MemoryError error shown below on the line rr.fit(X_train, y_train) that executes the fitting procedure.

How can I prevent this error?

Code snippet

def fit(X_train, y_train):
    alphas = [1e-3, 1e-2, 1e-1, 1e0, 1e1]

    rr = RidgeCV(alphas=alphas)
    rr.fit(X_train, y_train)

    return rr


rr = fit(X_train, y_train)

Error

MemoryError                               Traceback (most recent call last)
<ipython-input-41-a433716e7179> in <module>()
      1 # Fit Training set
----> 2 rr = fit(X_train, y_train)

<ipython-input-35-9650bd58e76c> in fit(X_train, y_train)
      3 
      4     rr = RidgeCV(alphas=alphas)
----> 5     rr.fit(X_train, y_train)
      6 
      7     return rr

C:\Python27\lib\site-packages\sklearn\linear_model\ridge.pyc in fit(self, X, y, sample_weight)
    696                                   gcv_mode=self.gcv_mode,
    697                                   store_cv_values=self.store_cv_values)
--> 698             estimator.fit(X, y, sample_weight=sample_weight)
    699             self.alpha_ = estimator.alpha_
    700             if self.store_cv_values:

C:\Python27\lib\site-packages\sklearn\linear_model\ridge.pyc in fit(self, X, y, sample_weight)
    608             raise ValueError('bad gcv_mode "%s"' % gcv_mode)
    609 
--> 610         v, Q, QT_y = _pre_compute(X, y)
    611         n_y = 1 if len(y.shape) == 1 else y.shape[1]
    612         cv_values = np.zeros((n_samples * n_y, len(self.alphas)))

C:\Python27\lib\site-packages\sklearn\linear_model\ridge.pyc in _pre_compute_svd(self, X, y)
    531     def _pre_compute_svd(self, X, y):
    532         if sparse.issparse(X) and hasattr(X, 'toarray'):
--> 533             X = X.toarray()
    534         U, s, _ = np.linalg.svd(X, full_matrices=0)
    535         v = s ** 2

C:\Python27\lib\site-packages\scipy\sparse\compressed.pyc in toarray(self, order, out)
    559     def toarray(self, order=None, out=None):
    560         """See the docstring for `spmatrix.toarray`."""
--> 561         return self.tocoo(copy=False).toarray(order=order, out=out)
    562 
    563     ##############################################################

C:\Python27\lib\site-packages\scipy\sparse\coo.pyc in toarray(self, order, out)
    236     def toarray(self, order=None, out=None):
    237         """See the docstring for `spmatrix.toarray`."""
--> 238         B = self._process_toarray_args(order, out)
    239         fortran = int(B.flags.f_contiguous)
    240         if not fortran and not B.flags.c_contiguous:

C:\Python27\lib\site-packages\scipy\sparse\base.pyc in _process_toarray_args(self, order, out)
    633             return out
    634         else:
--> 635             return np.zeros(self.shape, dtype=self.dtype, order=order)
    636 
    637 

MemoryError: 

Code

print type(X_train)
print X_train.shape

Result

<class 'scipy.sparse.csr.csr_matrix'>
(183576, 101507)
like image 841
Nyxynyx Avatar asked May 02 '13 06:05

Nyxynyx


2 Answers

Take a look at this part of your stack trace:

    531     def _pre_compute_svd(self, X, y):
    532         if sparse.issparse(X) and hasattr(X, 'toarray'):
--> 533             X = X.toarray()
    534         U, s, _ = np.linalg.svd(X, full_matrices=0)
    535         v = s ** 2

The algorithm you're using relies on numpy's linear algebra routines to do SVD. But those can't handle sparse matrices, so the author simply converts them to regular non-sparse arrays. The first thing that has to happen for this is to allocate an all-zero array and then fill in the appropriate spots with the values sparsely stored in the sparse matrix. Sounds easy enough, but let's math. A float64 (the default dtype, which you're probably using if you don't know what you're using) element takes 8 bytes. So, based on the array shape you've provided, the new zero-filled array will be:

183576 * 101507 * 8 = 149,073,992,256 ~= 150 gigabytes

Your system's memory manager probably took one look at that allocation request and committed suicide. But what can you do about it?

First off, that looks like a fairly ridiculous number of features. I don't know anything about your problem domain or what your features are, but my gut reaction is that you need to do some dimensionality reduction here.

Second, you can try to fix the algorithm's mishandling of sparse matrices. It's choking on numpy.linalg.svd here, so you might be able to use scipy.sparse.linalg.svds instead. I don't know the algorithm in question, but it might not be amenable to sparse matrices. Even if you use the appropriate sparse linear algebra routines, it might produce (or internally use) some non-sparse matrices with sizes similar to your data. Using a sparse matrix representation to represent non-sparse data will only result in using more space than you would have originally, so this approach might not work. Proceed with caution.

like image 51
kwatford Avatar answered Nov 03 '22 05:11

kwatford


The relevant option here is gcv_mode. It can take 3 values: "auto", "svd" and "eigen". By default, it is set to "auto", which has the following behavior: use the svd mode if n_samples > n_features, otherwise use the eigen mode.

Since in your case n_samples > n_features, the svd mode is chosen. However, the svd mode currently doesn't handle sparse data properly. scikit-learn should be fixed to use proper sparse SVD instead of the dense SVD.

As a workaround, I would force the eigen mode by gcv_mode="eigen", since this mode should properly handle sparse data. However, n_samples is quite large in your case. Since the eigen mode builds a kernel matrix (and thus has n_samples ** 2 memory complexity), the kernel matrix may not fit in memory. In that case, I would just reduce the number of samples (the eigen mode can handle very large number of features without problem, though).

In any case, since both n_samples and n_features are quite large, you are pushing this implementation to its limits (even with a proper sparse SVD).

Also see https://github.com/scikit-learn/scikit-learn/issues/1921

like image 6
Mathieu Avatar answered Nov 03 '22 05:11

Mathieu