Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to map a discontiuous data on disk to an array with python?

I want to map a big fortran record (12G) on hard disk to a numpy array. (Mapping instead of loading for saving memory.)

The data stored in fortran record is not continuous as it is divided by record markers. The record structure is as "marker, data, marker, data,..., data, marker". The length of data regions and markers are known.

The length of data between markers is not multiple of 4 bytes, otherwise I can map each data region to an array.

The first marker can be skipped by setting offset in memmap, is it possible to skip other markers and map the data to an array?

Apology for possible ambiguous expression and thanks for any solution or suggestion.


Edited May 15

These are fortran unformatted files. The data stored in record is a (1024^3)*3 float32 array (12Gb).

The record layout of variable-length records that are greater than 2 gigabytes is shown below:

data structure

(For details see here -> the section [Record Types] -> [Variable-Length Records].)

In my case, except the last one, each subrecord has a length of 2147483639 bytes and separated by 8 bytes (as you see in the figure above, a end marker of the previous subrecord and a begin marker of the following one, 8 bytes in total ) .

We can see the first subrecord ends with the first 3 bytes of certain float number and the second subrecord begins with the rest 1 byte as 2147483639 mod 4 =3.

like image 467
Syrtis Major Avatar asked May 13 '13 05:05

Syrtis Major


1 Answers

I posted another answer because for the example given here numpy.memmap worked:

offset = 0
data1 = np.memmap('tmp', dtype='i', mode='r+', order='F',
                  offset=0, shape=(size1))
offset += size1*byte_size
data2 = np.memmap('tmp', dtype='i', mode='r+', order='F',
                  offset=offset, shape=(size2))
offset += size1*byte_size
data3 = np.memmap('tmp', dtype='i', mode='r+', order='F',
                  offset=offset, shape=(size3))

for int32 byte_size=32/8, for int16 byte_size=16/8 and so forth...

If the sizes are constant, you can load the data in a 2D array like:

shape = (total_length/size,size)
data = np.memmap('tmp', dtype='i', mode='r+', order='F', shape=shape)

You can change the memmap object as long as you want. It is even possible to make arrays sharing the same elements. In that case the changes made in one are automatically updated in the other.

Other references:

  • Working with big data in python and numpy, not enough ram, how to save partial results on disc?

  • numpy.memmap documentation here.

like image 199
Saullo G. P. Castro Avatar answered Sep 17 '22 17:09

Saullo G. P. Castro