How do I use scikit-learn to train a model on a large csv data (~75MB) without running into memory problems?
I'm using IPython notebook as the programming environment, and pandas+sklearn packages to analyze data from kaggle's digit recognizer tutorial.
The data is available on the webpage , link to my code , and here is the error message:
KNeighborsClassifier
is used for the prediction.
Problem:
"MemoryError" occurs when loading large dataset using read_csv function. To bypass this problem temporarily, I have to restart the kernel, which then read_csv function successfully loads the file, but the same error occurs when I run the same cell again.
When the read_csv
function loads the file successfully, after making changes to the dataframe
, I can pass the features and labels to the KNeighborsClassifier's fit() function. At this point, similar memory error occurs.
I tried the following:
Iterate through the CSV file in chunks, and fit the data accordingly, but the problem is that the predictive model is overwritten every time for a chunk of data.
What do you think I can do to successfully train my model without running into memory problems?
Scikit-learn is steadily evolving with new models, efficiency improvements on speed and memory, and large data capabilities. Although scikit-learn is optimized for smaller data, it does offer a decent set of algorithms for out-of-core classification, regression, clustering and decomposition.
Pandas doesn't have multiprocessing support and it is slow with bigger datasets. There is a better tool that puts those CPU cores to work! Pandas is one of the best tools when it comes to Exploratory Data Analysis. But this doesn't mean that it is the best tool available for every task — like big data processing.
Both frameworks can be used with scikit learn. You can load 22 GB of data into Dask or SFrame, then use with sklearn.
The upper limit for pandas Dataframe was 100 GB of free disk space on the machine. When your Mac needs memory, it will push something that isn't currently being used into a swapfile for temporary storage.
Note: when you load the data with pandas it will create a DataFrame
object where each column has an homogeneous datatype for all the rows but 2 columns can have distinct datatypes (e.g. integer, dates, strings).
When you pass a DataFrame
instance to a scikit-learn model it will first allocate a homogeneous 2D numpy array with dtype np.float32 or np.float64 (depending on the implementation of the models). At this point you will have 2 copies of your dataset in memory.
To avoid this you could write / reuse a CSV parser that directly allocates the data in the internal format / dtype expected by the scikit-learn model. You can try numpy.loadtxt
for instance (have a look at the docstring for the parameters).
Also if you data is very sparse (many zero values) it will be better to use a scipy.sparse datastructure and a scikit-learn model that can deal with such an input format (check the docstrings to know). However the CSV format itself is not very well suited for sparse data and I am not sure there exist a direct CSV-to-scipy.sparse
parser.
Edit: for reference KNearestNeighborsClassifer allocate temporary distances array with shape (n_samples_predict, n_samples_train)
which is very wasteful when only (n_samples_predict, n_neighbors)
is needed instead. This issue can be tracked here:
https://github.com/scikit-learn/scikit-learn/issues/325
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With