How to read a big binary file and split its content by some marker

Question

In Python, reading a big text file line-by-line is simple:

for line in open('somefile', 'r'): ...

But how to read a binary file and 'split' (by generator) its content by some given marker, not the newline ' '?

I want something like that:

content = open('somefile', 'r').read()
result = content.split('some_marker')

but, of course, memory-efficient (the file is around 70GB). Of course, we can't read the file by every byte (it'll be too slow because of the HDD nature).

The 'chunks' length (the data between those markers) might differ, theoretically from 1 byte to megabytes.

So, to give an example to sum up, the data looks like that (digits mean bytes here, the data is in a binary format):

12345223-MARKER-3492-MARKER-34834983428623762374632784-MARKER-888-MARKER-...

Is there any simple way to do that (not implementing reading in chunks, splitting the chunks, remembering tails etc.)?

user4815162342 · Accepted Answer

There is no magic in Python that will do it for you, but it's not hard to write. For example:

def split_file(fp, marker):
    BLOCKSIZE = 4096
    result = []
    current = ''
    for block in iter(lambda: fp.read(BLOCKSIZE), ''):
        current += block
        while 1:
            markerpos = current.find(marker)
            if markerpos == -1:
                break
            result.append(current[:markerpos])
            current = current[markerpos + len(marker):]
    result.append(current)
    return result

Memory usage of this function can be further reduced by turning it into a generator, i.e. converting result.append(...) to yield .... This is left as an excercise to the reader.

How to read a big binary file and split its content by some marker

Tags:

python

Spaceman

1 Answers

user4815162342

Recent Activity

Donate For Us

How to read a big binary file and split its content by some marker

Tags:

python

Spaceman

1 Answers

user4815162342

Related questions

Recent Activity

Donate For Us