I am loading a csv file via numpy.loadtxt into a numpy array. My data has about 1 million records and 87 columns. While the object.nbytes is only 177159666 bytes, it actually takes much more meomory because I get 'MemoryError' while training a Decision Tree using scikit-learn. Also, after reading the data, the available memory in my system reduced by 1.8 gigs. I am working on linux machine with 3 gigs of memory. So does object.nbytes returns the real memory usage of an numpy array?
train = np.loadtxt('~/Py_train.csv', delimiter=',', skiprows=1, dtype='float16')
I had a similar problem when trying to create a large 400,000 x 100,000 matrix. Fitting all of that data into an ndarray is impossible.
However, the big insight I came up with was that most of the values in the matrix are empty, and thus this can be represented as a sparse matrix. Sparse matrices are useful because it is able to represent the data using less memory. I used Scipy.sparse's sparse matrix implementation, and I'm able to fit this large matrix in-memory.
Here is my implementation:
https://github.com/paolodm/Kaggle/blob/master/mdschallenge/buildmatrix.py
Probably, better performance is by using numpy.fromiter
:
In [30]: numpy.fromiter((tuple(row) for row in csv.reader(open('/tmp/data.csv'))), dtype='i4,i4,i4')
Out[30]:
array([(1, 2, 3), (4, 5, 6)], dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4')])
where
$ cat /tmp/data.csv
1,2,3
4,5,6
Alternatively, I strongly suggest you to use pandas
: it's based on numpy
and has many utility functions to do statistical analysis.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With