Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Preferred block size when reading/writing big binary files

I need to read and write huge binary files. Is there a preferred or even optimal number of bytes (what I call BLOCK_SIZE) I should read() at a time?

One byte is certainly too little, and I do not think reading 4 GB into the RAM is a good idea either - is there a 'best' block size? or does that even depend on the file-system (I'm on ext4)? What do I need to consider?

Python's open() even provides a buffering argument. Would I need to tweak that as well?

This is sample code that just joins the two files in-0.data and in-1.data into out.data (in real life there is more processing that is irrelevant to the question at hand). The BLOCK_SIZE is chosen equal to io.DEFAULT_BUFFER_SIZE which seems to be the default for buffering:

from pathlib import Path
from functools import partial

DATA_PATH = Path(__file__).parent / '../data/'

out_path = DATA_PATH / 'out.data'
in_paths = (DATA_PATH / 'in-0.data', DATA_PATH / 'in-1.data')

BLOCK_SIZE = 8192

def process(data):
    pass

with out_path.open('wb') as out_file:
    for in_path in in_paths:
        with in_path.open('rb') as in_file:
            for data in iter(partial(in_file.read, BLOCK_SIZE), b''):
                process(data)
                out_file.write(data)
#            while True:
#                data = in_file.read(BLOCK_SIZE)
#                if not data:
#                    break
#                process(data)
#                out_file.write(data)
like image 457
hiro protagonist Avatar asked Sep 23 '15 19:09

hiro protagonist


1 Answers

Let the OS make the decision for you. Use the mmap module:

https://docs.python.org/3/library/mmap.html

It uses your OS's underlying memory mapping mechanism for mapping the contents of a file into RAM.

Be aware that there's a 2GB file size limit if you're using 32-bit Python, so be sure to use the 64-bit version if you decide to go this route.

For example:

f1 = open('input_file', 'r+b')
m1 = mmap.mmap(f1.fileno(), 0)
f2 = open('out_file', 'a+b') # out_file must be >0 bytes on windows
m2 = mmap.mmap(f2.fileno(), 0)
m2.resize(len(m1))
m2[:] = m1 # copy input_file to out_file
m2.flush() # flush results

Note that you never had to call any read() functions and decide how many bytes to bring into RAM. This example just copies one file into another, but as you said in your example, you can do whatever processing you need in between. Note that while the entire file is mapped to an address space in RAM, that doesn't mean it has actually been copied there. It will be copied piecewise, at the discretion of the OS.

like image 169
Chad Kennedy Avatar answered Sep 22 '22 07:09

Chad Kennedy