Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Floating Point Exception with Numpy and PyTables

I have a rather large HDF5 file generated by PyTables that I am attempting to read on a cluster. I am running into a problem with NumPy as I read in an individual chunk. Let's go with the example:

The total shape of the array within in the HDF5 file is,

In [13]: data.shape
Out[13]: (21933063, 800, 3)

Each entry in this array is a np.float64.

I am having each node read slices of size (21933063,10,3). Unfortunately, NumPy seems to be unable to read all 21 million subslices at once. I have tried to do this sequentially by dividing up these slices into 10 slices of size (2193306,10,3) and then using the following reduce to get things working:

In [8]: a = reduce(lambda x,y : np.append(x,y,axis=0), [np.array(data[i*      \
        chunksize: (i+1)*chunksize,:10],dtype=np.float64) for i in xrange(k)])
In [9]:

where 1 <= k <= 10 and chunksize = 2193306. This code works for k <= 9; otherwise I get the following:

In [8]: a = reduce(lambda x,y : np.append(x,y,axis=0), [np.array(data[i*      \
        chunksize: (i+1)*chunksize,:10],dtype=np.float64) for i in xrange(k)])
Floating point exception
home@mybox  00:00:00  ~
$

I tried using Valgrind's memcheck tool to figure out what is going on and it seems as if PyTables is the culprit. The two main files that show up in the trace are libhdf5.so.6 and a file related to blosc.

Also, note that if I have k=8, I get:

In [12]: a.shape
Out[12]: (17546448, 10, 3)

But if I append the last subslice, I get:

In [14]: a = np.append(a,np.array(data[8*chunksize:9*chunksize,:10],   \
         dtype=np.float64))
In [15]: a.shape
Out[15]: (592192620,)

Does anyone have any ideas of what to do? Thanks!

like image 628
Tarun Chitra Avatar asked Sep 30 '11 23:09

Tarun Chitra


1 Answers

Did you try to allocate such a big array before (like DaveP suggests)?

In [16]: N.empty((21000000,800,3))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
...
ValueError: array is too big.

This is on 32bit Python. You would actually need 20e6*800*3*8/1e9=384 GBytes of memory! One Float64 needs 8 bytes. Do you really need the whole array at once?

Sorry, did not read post properly.

Your array with k=8 subslices is already about 4.1 GByte big. Maybe that is the problem?

Does it work if you use only 8 instead of 10 for the last dimension?

Another suggestion, i would try first to resize the array, then fill it up:

a = zeros((4,8,3))
a = resize(a, (8,8,3))
a[4:] = ones((4,8,3))
like image 124
mrossi Avatar answered Nov 19 '22 02:11

mrossi