Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read a big binary file and split its content by some marker

Tags:

python

In Python, reading a big text file line-by-line is simple:

for line in open('somefile', 'r'): ...

But how to read a binary file and 'split' (by generator) its content by some given marker, not the newline '\n'?

I want something like that:

content = open('somefile', 'r').read()
result = content.split('some_marker')

but, of course, memory-efficient (the file is around 70GB). Of course, we can't read the file by every byte (it'll be too slow because of the HDD nature).

The 'chunks' length (the data between those markers) might differ, theoretically from 1 byte to megabytes.

So, to give an example to sum up, the data looks like that (digits mean bytes here, the data is in a binary format):

12345223-MARKER-3492-MARKER-34834983428623762374632784-MARKER-888-MARKER-...

Is there any simple way to do that (not implementing reading in chunks, splitting the chunks, remembering tails etc.)?

like image 233
Spaceman Avatar asked Sep 15 '13 06:09

Spaceman


1 Answers

There is no magic in Python that will do it for you, but it's not hard to write. For example:

def split_file(fp, marker):
    BLOCKSIZE = 4096
    result = []
    current = ''
    for block in iter(lambda: fp.read(BLOCKSIZE), ''):
        current += block
        while 1:
            markerpos = current.find(marker)
            if markerpos == -1:
                break
            result.append(current[:markerpos])
            current = current[markerpos + len(marker):]
    result.append(current)
    return result

Memory usage of this function can be further reduced by turning it into a generator, i.e. converting result.append(...) to yield .... This is left as an excercise to the reader.

like image 87
user4815162342 Avatar answered Sep 23 '22 21:09

user4815162342