Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Numpy array taking too much memory

Tags:

python

numpy

I am loading a csv file via numpy.loadtxt into a numpy array. My data has about 1 million records and 87 columns. While the object.nbytes is only 177159666 bytes, it actually takes much more meomory because I get 'MemoryError' while training a Decision Tree using scikit-learn. Also, after reading the data, the available memory in my system reduced by 1.8 gigs. I am working on linux machine with 3 gigs of memory. So does object.nbytes returns the real memory usage of an numpy array?

train = np.loadtxt('~/Py_train.csv', delimiter=',', skiprows=1, dtype='float16')
like image 475
ibictts Avatar asked Feb 20 '23 16:02

ibictts


2 Answers

I had a similar problem when trying to create a large 400,000 x 100,000 matrix. Fitting all of that data into an ndarray is impossible.

However, the big insight I came up with was that most of the values in the matrix are empty, and thus this can be represented as a sparse matrix. Sparse matrices are useful because it is able to represent the data using less memory. I used Scipy.sparse's sparse matrix implementation, and I'm able to fit this large matrix in-memory.

Here is my implementation:

https://github.com/paolodm/Kaggle/blob/master/mdschallenge/buildmatrix.py

like image 72
Paolo del Mundo Avatar answered Feb 23 '23 05:02

Paolo del Mundo


Probably, better performance is by using numpy.fromiter:

In [30]: numpy.fromiter((tuple(row) for row in csv.reader(open('/tmp/data.csv'))), dtype='i4,i4,i4')
Out[30]: 
array([(1, 2, 3), (4, 5, 6)], dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4')])

where

$ cat /tmp/data.csv 
1,2,3
4,5,6

Alternatively, I strongly suggest you to use pandas: it's based on numpy and has many utility functions to do statistical analysis.

like image 41
lbolla Avatar answered Feb 23 '23 04:02

lbolla