I am running Python 2.7 (64-bit) on a Windows 8 64-bit system with 24GB memory. When doing the fitting of the usual Sklearn.linear_models.Ridge
, the code runs fine.
Problem: However when using Sklearn.linear_models.RidgeCV(alphas=alphas)
for the fitting, I run into the MemoryError
error shown below on the line rr.fit(X_train, y_train)
that executes the fitting procedure.
How can I prevent this error?
Code snippet
def fit(X_train, y_train):
alphas = [1e-3, 1e-2, 1e-1, 1e0, 1e1]
rr = RidgeCV(alphas=alphas)
rr.fit(X_train, y_train)
return rr
rr = fit(X_train, y_train)
Error
MemoryError Traceback (most recent call last)
<ipython-input-41-a433716e7179> in <module>()
1 # Fit Training set
----> 2 rr = fit(X_train, y_train)
<ipython-input-35-9650bd58e76c> in fit(X_train, y_train)
3
4 rr = RidgeCV(alphas=alphas)
----> 5 rr.fit(X_train, y_train)
6
7 return rr
C:\Python27\lib\site-packages\sklearn\linear_model\ridge.pyc in fit(self, X, y, sample_weight)
696 gcv_mode=self.gcv_mode,
697 store_cv_values=self.store_cv_values)
--> 698 estimator.fit(X, y, sample_weight=sample_weight)
699 self.alpha_ = estimator.alpha_
700 if self.store_cv_values:
C:\Python27\lib\site-packages\sklearn\linear_model\ridge.pyc in fit(self, X, y, sample_weight)
608 raise ValueError('bad gcv_mode "%s"' % gcv_mode)
609
--> 610 v, Q, QT_y = _pre_compute(X, y)
611 n_y = 1 if len(y.shape) == 1 else y.shape[1]
612 cv_values = np.zeros((n_samples * n_y, len(self.alphas)))
C:\Python27\lib\site-packages\sklearn\linear_model\ridge.pyc in _pre_compute_svd(self, X, y)
531 def _pre_compute_svd(self, X, y):
532 if sparse.issparse(X) and hasattr(X, 'toarray'):
--> 533 X = X.toarray()
534 U, s, _ = np.linalg.svd(X, full_matrices=0)
535 v = s ** 2
C:\Python27\lib\site-packages\scipy\sparse\compressed.pyc in toarray(self, order, out)
559 def toarray(self, order=None, out=None):
560 """See the docstring for `spmatrix.toarray`."""
--> 561 return self.tocoo(copy=False).toarray(order=order, out=out)
562
563 ##############################################################
C:\Python27\lib\site-packages\scipy\sparse\coo.pyc in toarray(self, order, out)
236 def toarray(self, order=None, out=None):
237 """See the docstring for `spmatrix.toarray`."""
--> 238 B = self._process_toarray_args(order, out)
239 fortran = int(B.flags.f_contiguous)
240 if not fortran and not B.flags.c_contiguous:
C:\Python27\lib\site-packages\scipy\sparse\base.pyc in _process_toarray_args(self, order, out)
633 return out
634 else:
--> 635 return np.zeros(self.shape, dtype=self.dtype, order=order)
636
637
MemoryError:
Code
print type(X_train)
print X_train.shape
Result
<class 'scipy.sparse.csr.csr_matrix'>
(183576, 101507)
Take a look at this part of your stack trace:
531 def _pre_compute_svd(self, X, y):
532 if sparse.issparse(X) and hasattr(X, 'toarray'):
--> 533 X = X.toarray()
534 U, s, _ = np.linalg.svd(X, full_matrices=0)
535 v = s ** 2
The algorithm you're using relies on numpy's linear algebra routines to do SVD. But those can't handle sparse matrices, so the author simply converts them to regular non-sparse arrays. The first thing that has to happen for this is to allocate an all-zero array and then fill in the appropriate spots with the values sparsely stored in the sparse matrix. Sounds easy enough, but let's math. A float64 (the default dtype, which you're probably using if you don't know what you're using) element takes 8 bytes. So, based on the array shape you've provided, the new zero-filled array will be:
183576 * 101507 * 8 = 149,073,992,256 ~= 150 gigabytes
Your system's memory manager probably took one look at that allocation request and committed suicide. But what can you do about it?
First off, that looks like a fairly ridiculous number of features. I don't know anything about your problem domain or what your features are, but my gut reaction is that you need to do some dimensionality reduction here.
Second, you can try to fix the algorithm's mishandling of sparse matrices. It's choking on numpy.linalg.svd
here, so you might be able to use scipy.sparse.linalg.svds
instead. I don't know the algorithm in question, but it might not be amenable to sparse matrices. Even if you use the appropriate sparse linear algebra routines, it might produce (or internally use) some non-sparse matrices with sizes similar to your data. Using a sparse matrix representation to represent non-sparse data will only result in using more space than you would have originally, so this approach might not work. Proceed with caution.
The relevant option here is gcv_mode. It can take 3 values: "auto", "svd" and "eigen". By default, it is set to "auto", which has the following behavior: use the svd mode if n_samples > n_features, otherwise use the eigen mode.
Since in your case n_samples > n_features, the svd mode is chosen. However, the svd mode currently doesn't handle sparse data properly. scikit-learn should be fixed to use proper sparse SVD instead of the dense SVD.
As a workaround, I would force the eigen mode by gcv_mode="eigen", since this mode should properly handle sparse data. However, n_samples is quite large in your case. Since the eigen mode builds a kernel matrix (and thus has n_samples ** 2 memory complexity), the kernel matrix may not fit in memory. In that case, I would just reduce the number of samples (the eigen mode can handle very large number of features without problem, though).
In any case, since both n_samples and n_features are quite large, you are pushing this implementation to its limits (even with a proper sparse SVD).
Also see https://github.com/scikit-learn/scikit-learn/issues/1921
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With