Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

numpy Loadtxt function seems to be consuming too much memory

When I load an array using numpy.loadtxt, it seems to take too much memory. E.g.

a = numpy.zeros(int(1e6))

causes an increase of about 8MB in memory (using htop, or just 8bytes*1million \approx 8MB). On the other hand, if I save and then load this array

numpy.savetxt('a.csv', a)
b = numpy.loadtxt('a.csv')

my memory usage increases by about 100MB! Again I observed this with htop. This was observed while in the iPython shell, and also while stepping through code using Pdb++.

Any idea what's going on here?

After reading jozzas's answer, I realized that if I know ahead of time the array size, there is a much more memory efficient way to do things if say 'a' was an mxn array:

b = numpy.zeros((m,n))
with open('a.csv', 'r') as f:
    reader = csv.reader(f)
    for i, row in enumerate(reader):
        b[i,:] = numpy.array(row)
like image 704
Ian Langmore Avatar asked Oct 27 '11 00:10

Ian Langmore


1 Answers

Saving this array of floats to a text file creates a 24M text file. When you re-load this, numpy goes through the file line-by-line, parsing the text and recreating the objects.

I would expect memory usage to spike during this time, as numpy doesn't know how big the resultant array needs to be until it gets to the end of the file, so I'd expect there to be at least 24M + 8M + other temporary memory used.

Here's the relevant bit of the numpy code, from /lib/npyio.py:

    # Parse each line, including the first
    for i, line in enumerate(itertools.chain([first_line], fh)):
        vals = split_line(line)
        if len(vals) == 0:
            continue
        if usecols:
            vals = [vals[i] for i in usecols]
        # Convert each value according to its column and store
        items = [conv(val) for (conv, val) in zip(converters, vals)]
        # Then pack it according to the dtype's nesting
        items = pack_items(items, packing)
        X.append(items)

    #...A bit further on
    X = np.array(X, dtype)

This additional memory usage shouldn't be a concern, as this is just the way python works - while your python process appears to be using 100M of memory, internally it maintains knowledge of which items are no longer used, and will re-use that memory. For example, if you were to re-run this save-load procedure in the one program (save, load, save, load), your memory usage will not increase to 200M.

like image 155
John Lyon Avatar answered Sep 28 '22 05:09

John Lyon