I'm attempting to use sklearn 0.11's LogisticRegression object to fit a model on 200,000 observations with about 80,000 features. The goal is to classify short text descriptions into 1 of 800 classes.
When I attempt to fit the classifier pythonw.exe gives me:
Application Error "The instruction at ... referenced memory at 0x00000000". The memory could not be written".
The features are extremely sparse, about 10 per observation, and are binary (either 1 or 0), so by my back of the envelope calculation my 4 GB of RAM should be able to handle the memory requirements, but that doesn't appear to be the case. The models only fit when I use fewer observations and/or fewer features.
If anything, I would like to use even more observations and features. My naive understanding is that the liblinear library running things behind the scenes is capable of supporting that. Any ideas for how I might squeeze a few more observations in?
My code looks like this:
y_vectorizer = LabelVectorizer(y) # my custom vectorizer for labels
y = y_vectorizer.fit_transform(y)
x_vectorizer = CountVectorizer(binary = True, analyzer = features)
x = x_vectorizer.fit_transform(x)
clf = LogisticRegression()
clf.fit(x, y)
The features() function I pass to analyzer just returns a list of strings indicating the features detected in each observation.
I'm using Python 2.7, sklearn 0.11, Windows XP with 4 GB of RAM.
liblinear — Library for Large Linear Classification. Uses a coordinate descent algorithm. Coordinate descent is based on minimizing a multivariate function by solving univariate optimization problems in a loop. In other words, it moves toward the minimum in one direction at a time.
It computes the probability of an event occurrence. It is a special case of linear regression where the target variable is categorical in nature. It uses a log of odds as the dependent variable. Logistic Regression predicts the probability of occurrence of a binary event utilizing a logit function.
C is known as a "hyperparameter." The parameters are numbers that tell the model what to do with the characteristics, whereas the hyperparameters instruct the model on how to choose parameters. Regularization will penalize the extreme parameters, the extreme values in the training data leads to overfitting.
linear_model is a class of the sklearn module if contain different functions for performing machine learning with linear models. The term linear model implies that the model is specified as a linear combination of features.
liblinear (the backing implementation of sklearn.linear_model.LogisticRegression
) will host its own copy of the data because it is a C++ library whose internal memory layout cannot be directly mapped onto a pre-allocated sparse matrix in scipy such as scipy.sparse.csr_matrix
or scipy.sparse.csc_matrix
.
In your case I would recommend to load your data as a scipy.sparse.csr_matrix
and feed it to a sklearn.linear_model.SGDClassifier
(with loss='log'
if you want a logistic regression model and the ability to call the predict_proba
method). SGDClassifier
will not copy the input data if it's already using the scipy.sparse.csr_matrix
memory layout.
Expect it to allocate a dense model of 800 * (80000 + 1) * 8 / (1024 ** 2) = 488MB in memory (in addition to the size of your input dataset).
Edit: how to optimize the memory access for your dataset
To free memory after dataset extraction you can:
x_vectorizer = CountVectorizer(binary = True, analyzer = features)
x = x_vectorizer.fit_transform(x)
from sklearn.externals import joblib
joblib.dump(x.tocsr(), 'dataset.joblib')
Then quit this python process (to force complete memory deallocation) and in a new process:
x_csr = joblib.load('dataset.joblib')
Under linux / OSX you could memory map that even more efficiently with:
x_csr = joblib.load('dataset.joblib', mmap_mode='c')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With