Scikit and Pandas: Fitting Large Data

Tags:

How do I use scikit-learn to train a model on a large csv data (~75MB) without running into memory problems?

I'm using IPython notebook as the programming environment, and pandas+sklearn packages to analyze data from kaggle's digit recognizer tutorial.

The data is available on the webpage , link to my code , and here is the error message:

KNeighborsClassifier is used for the prediction.

Problem:

"MemoryError" occurs when loading large dataset using read_csv function. To bypass this problem temporarily, I have to restart the kernel, which then read_csv function successfully loads the file, but the same error occurs when I run the same cell again.

When the read_csv function loads the file successfully, after making changes to the dataframe, I can pass the features and labels to the KNeighborsClassifier's fit() function. At this point, similar memory error occurs.

I tried the following:

Iterate through the CSV file in chunks, and fit the data accordingly, but the problem is that the predictive model is overwritten every time for a chunk of data.

What do you think I can do to successfully train my model without running into memory problems?

609

asked Jul 29 '12 06:07

Ji Park

1 Answers

Note: when you load the data with pandas it will create a DataFrame object where each column has an homogeneous datatype for all the rows but 2 columns can have distinct datatypes (e.g. integer, dates, strings).

When you pass a DataFrame instance to a scikit-learn model it will first allocate a homogeneous 2D numpy array with dtype np.float32 or np.float64 (depending on the implementation of the models). At this point you will have 2 copies of your dataset in memory.

To avoid this you could write / reuse a CSV parser that directly allocates the data in the internal format / dtype expected by the scikit-learn model. You can try numpy.loadtxt for instance (have a look at the docstring for the parameters).

Also if you data is very sparse (many zero values) it will be better to use a scipy.sparse datastructure and a scikit-learn model that can deal with such an input format (check the docstrings to know). However the CSV format itself is not very well suited for sparse data and I am not sure there exist a direct CSV-to-scipy.sparse parser.

Edit: for reference KNearestNeighborsClassifer allocate temporary distances array with shape (n_samples_predict, n_samples_train) which is very wasteful when only (n_samples_predict, n_neighbors) is needed instead. This issue can be tracked here:

https://github.com/scikit-learn/scikit-learn/issues/325

answered Sep 22 '22 10:09

ogrisel

Related questions
                            
                                MySQL excessive memory usage
                            
                                Memory leak in Scala and processes
                            
                                Are the pmap's RSS and htop's RES the same?
                            
                                How to get jmap histogram programmatically?
                            
                                Can you assign the value of one union member to another?
                            
                                Python Pandas Merge Causing Memory Overflow
                            
                                Spark - StorageLevel (DISK_ONLY vs MEMORY_AND_DISK) and Out of memory Java heap space
                            
                                how to decide the memory requirement for my elasticsearch server
                            
                                How do I store a Python object in memory for use by different processes?
                            
                                nodejs v8 memory gc allocation failure
                            
                                Constantly growing memory allocation while fetching images over HTTP in iOS
                            
                                What is The Memory Address of Character Table In DOS? [closed]
                            
                                Java byte array of 1 MB or more takes up twice the RAM
                            
                                Understanding CLR 2.0 Memory Model
                            
                                File size vs. in memory size in Java
                            
                                Memory consumption of NumPy function for standard deviation
                            
                                How does Python determine if two strings are identical
                            
                                What is saved in a context switch?
                            
                                What is exactly happening when I spawn a new thread from .NET?
                            
                                Unix: sharing already-mapped memory between processes

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scikit and Pandas: Fitting Large Data

Tags:

memory

pandas

machine-learning

classification

scikit-learn

Ji Park

People also ask

1 Answers

ogrisel

Recent Activity

Donate For Us