Python out of memory on large CSV file (numpy)

Tags:

I have a 3GB CSV file that I try to read with python, I need the median column wise.

from numpy import *  def data():     return genfromtxt('All.csv',delimiter=',')  data = data() # This is where it fails already.  med = zeros(len(data[0])) data = data.T for i in xrange(len(data)):     m = median(data[i])     med[i] = 1.0/float(m) print med

The error that I get is this:

Python(1545) malloc: *** mmap(size=16777216) failed (error code=12)  *** error: can't allocate region  *** set a breakpoint in malloc_error_break to debug  Traceback (most recent call last):    File "Normalize.py", line 40, in <module>    data = data()    File "Normalize.py", line 39, in data    return genfromtxt('All.csv',delimiter=',')  File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site- packages/numpy/lib/npyio.py", line 1495, in genfromtxt  for (i, line) in enumerate(itertools.chain([first_line, ], fhd)):  MemoryError

I think it's just an out of memory error. I am running a 64bit MacOSX with 4GB of ram and both numpy and Python compiled in 64bit mode.

How do I fix this? Should I try a distributed approach, just for the memory management?

Thanks

EDIT: Also tried with this but no luck...

genfromtxt('All.csv',delimiter=',', dtype=float16)

914

asked Jan 21 '12 21:01

Ihmahr

1 Answers

As other folks have mentioned, for a really large file, you're better off iterating.

However, you do commonly want the entire thing in memory for various reasons.

genfromtxt is much less efficient than loadtxt (though it handles missing data, whereas loadtxt is more "lean and mean", which is why the two functions co-exist).

If your data is very regular (e.g. just simple delimited rows of all the same type), you can also improve on either by using numpy.fromiter.

If you have enough ram, consider using np.loadtxt('yourfile.txt', delimiter=',') (You may also need to specify skiprows if you have a header on the file.)

As a quick comparison, loading ~500MB text file with loadtxt uses ~900MB of ram at peak usage, while loading the same file with genfromtxt uses ~2.5GB.

Loadtxt Memory and CPU usage of numpy.loadtxt while loading a ~500MB ascii file

Genfromtxt Memory and CPU usage of numpy.genfromtxt while loading a ~500MB ascii file

Alternately, consider something like the following. It will only work for very simple, regular data, but it's quite fast. (loadtxt and genfromtxt do a lot of guessing and error-checking. If your data is very simple and regular, you can improve on them greatly.)

import numpy as np  def generate_text_file(length=1e6, ncols=20):     data = np.random.random((length, ncols))     np.savetxt('large_text_file.csv', data, delimiter=',')  def iter_loadtxt(filename, delimiter=',', skiprows=0, dtype=float):     def iter_func():         with open(filename, 'r') as infile:             for _ in range(skiprows):                 next(infile)             for line in infile:                 line = line.rstrip().split(delimiter)                 for item in line:                     yield dtype(item)         iter_loadtxt.rowlength = len(line)      data = np.fromiter(iter_func(), dtype=dtype)     data = data.reshape((-1, iter_loadtxt.rowlength))     return data  #generate_text_file() data = iter_loadtxt('large_text_file.csv')

Fromiter

Using fromiter to load the same ~500MB data file

193

answered Sep 23 '22 05:09

Joe Kington

Related questions
                            
                                How can I move file into Recycle Bin / trash on different platforms using PyQt4?
                            
                                Proper way to test Django signals
                            
                                Fastest way to pack a list of floats into bytes in python
                            
                                Python RegExp global flag
                            
                                Visualization of scatter plots with overlapping points in matplotlib
                            
                                how to successfully install pyproj and geopandas?
                            
                                How to remove rows with null values from kth column onward in python
                            
                                How to delete the first line of a text file?
                            
                                UnicodeDecodeError in Python when reading a file, how to ignore the error and jump to the next line?
                            
                                applying regex to a pandas dataframe
                            
                                How to convert a column or row matrix to a diagonal matrix in Python?
                            
                                How to divide two columns element-wise in a pandas dataframe
                            
                                Pythonic way to iterate over sequence, 4 items at a time [duplicate]
                            
                                Sorting a python array/recarray by column
                            
                                list extend() to index, inserting list elements not only to the end
                            
                                Is it possible to get color gradients under curve in matplotlib?
                            
                                Generate SQL statements from a Pandas Dataframe
                            
                                This model has not yet been built error on model.summary()
                            
                                Convert fraction to float?
                            
                                How do I get the username in Python? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python out of memory on large CSV file (numpy)

Tags:

python

memory

csv

numpy

scipy

Ihmahr

People also ask

1 Answers

Joe Kington

Recent Activity

Donate For Us