Scikit-Learn Logistic Regression Memory Error

Tags:

scikit-learn

I'm attempting to use sklearn 0.11's LogisticRegression object to fit a model on 200,000 observations with about 80,000 features. The goal is to classify short text descriptions into 1 of 800 classes.

When I attempt to fit the classifier pythonw.exe gives me:

Application Error "The instruction at ... referenced memory at 0x00000000". The memory could not be written".

The features are extremely sparse, about 10 per observation, and are binary (either 1 or 0), so by my back of the envelope calculation my 4 GB of RAM should be able to handle the memory requirements, but that doesn't appear to be the case. The models only fit when I use fewer observations and/or fewer features.

If anything, I would like to use even more observations and features. My naive understanding is that the liblinear library running things behind the scenes is capable of supporting that. Any ideas for how I might squeeze a few more observations in?

My code looks like this:

y_vectorizer = LabelVectorizer(y) # my custom vectorizer for labels
y = y_vectorizer.fit_transform(y)

x_vectorizer = CountVectorizer(binary = True, analyzer = features)
x = x_vectorizer.fit_transform(x)

clf = LogisticRegression()
clf.fit(x, y)

The features() function I pass to analyzer just returns a list of strings indicating the features detected in each observation.

I'm using Python 2.7, sklearn 0.11, Windows XP with 4 GB of RAM.

946

asked Jun 25 '12 18:06

Alexander Measure

1 Answers

liblinear (the backing implementation of sklearn.linear_model.LogisticRegression) will host its own copy of the data because it is a C++ library whose internal memory layout cannot be directly mapped onto a pre-allocated sparse matrix in scipy such as scipy.sparse.csr_matrix or scipy.sparse.csc_matrix.

In your case I would recommend to load your data as a scipy.sparse.csr_matrix and feed it to a sklearn.linear_model.SGDClassifier (with loss='log' if you want a logistic regression model and the ability to call the predict_proba method). SGDClassifier will not copy the input data if it's already using the scipy.sparse.csr_matrix memory layout.

Expect it to allocate a dense model of 800 * (80000 + 1) * 8 / (1024 ** 2) = 488MB in memory (in addition to the size of your input dataset).

Edit: how to optimize the memory access for your dataset

To free memory after dataset extraction you can:

x_vectorizer = CountVectorizer(binary = True, analyzer = features)
x = x_vectorizer.fit_transform(x)
from sklearn.externals import joblib
joblib.dump(x.tocsr(), 'dataset.joblib')

Then quit this python process (to force complete memory deallocation) and in a new process:

x_csr = joblib.load('dataset.joblib')

Under linux / OSX you could memory map that even more efficiently with:

x_csr = joblib.load('dataset.joblib', mmap_mode='c')

answered Nov 11 '22 21:11

ogrisel

Related questions
                            
                                kNN with big sparse matrices in Python
                            
                                OneHotEncoder with string categorical values
                            
                                sklearn's PLSRegression: "ValueError: array must not contain infs or NaNs"
                            
                                No module name 'sklearn.forest.ensemble'
                            
                                How to forecast in python using machine learning , from a given set of geographical data?
                            
                                Unintended multithreading in Python (scikit-learn)
                            
                                How to preprocess data for machine learning? [closed]
                            
                                Use of 'random_state' parameter in sklearn.utils.shuffle?
                            
                                How to randomly select rows from a data set using pandas?
                            
                                How to visualize an sklearn GradientBoostingClassifier?
                            
                                Unable to transform string column to categorical matrix using Keras and Sklearn
                            
                                How to implement polynomial logistic regression in scikit-learn?
                            
                                How does sklearn random forest index feature_importances_
                            
                                Why does not GridSearchCV give best score ? - Scikit Learn
                            
                                Find the tf-idf score of specific words in documents using sklearn
                            
                                Cross validation for MNIST dataset with pytorch and sklearn
                            
                                Classification tree in sklearn giving inconsistent answers
                            
                                Remove single occurrences of words in vocabulary TF-IDF
                            
                                scipy sparse matrix: remove the rows whose all elements are zero
                            
                                sklearn: calculating accuracy score of k-means on the test data set

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With