Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Any efficient way to read datas from large binary file?

I need to handle tens of Gigabytes data in one binary file. Each record in the data file is variable length.

So the file is like:

<len1><data1><len2><data2>..........<lenN><dataN>

The data contains integer, pointer, double value and so on.

I found python can not even handle this situation. There is no problem if I read the whole file in memory. It's fast. But it seems the struct package is not good at performance. It almost stuck on unpack the bytes.

Any help is appreciated.

Thanks.

like image 977
limi Avatar asked Dec 14 '22 03:12

limi


1 Answers

struct and array, which other answers recommend, are fine for the details of the implementation, and might be all you need if your needs are always to sequentially read all of the file or a prefix of it. Other options include buffer, mmap, even ctypes, depending on many details you don't mention regarding your exact needs. Maybe a little specialized Cython-coded helper can offer all the extra performance you need, if no suitable and accessible library (in C, C++, Fortran, ...) already exists that can be interfaced for the purpose of handling this humongous file as you need to.

But clearly there are peculiar issues here -- how can a data file contain pointers, for example, which are intrinsically a concept related to addressing memory? Are they maybe "offsets" instead, and, if so, how exactly are they based and coded? Are your needs at all more advanced than simply sequential reading (e.g., random access), and if so, can you do a first "indexing" pass to get all the offsets from start of file to start of record into a more usable, compact, handily-formatted auxiliary file? (That binary file of offsets would be a natural for array -- unless the offsets need to be longer than array supports on your machine!). What is the distribution of record lengths and compositions and number of records to make up the "tens of gigabytes"? Etc, etc.

You have a very large scale problem (and no doubt very large scale hardware to support it, since you mention that you can easily read all of the file into memory that means a 64bit box with many tens of GB of RAM -- wow!), so it's well worth the detailed care to optimize the handling thereof -- but we can't help much with such detailed care unless we know enough detail to do so!-).

like image 152
Alex Martelli Avatar answered Dec 21 '22 23:12

Alex Martelli