Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Batching very large text file in python

I am trying to batch a very large text file (approximately 150 gigabytes) into several smaller text files (approximately 10 gigabytes).

My general process will be:

# iterate over file one line at a time
# accumulate batch as string 
--> # given a certain count that correlates to the size of my current accumulated batch and when that size is met: (this is where I am unsure)
        # write to file

# accumulate size count

I have a rough metric to calculate when to batch (when the desired batch size) but am not so clear how I should calculate how often to write to disk for a given batch. For example, if my batch size is 10 gigabytes, I assume I will need to iteratively write rather than hold the entire 10 gigbyte batch in memory. I obviously do not want to write more than I have to as this could be quite expensive.

Do ya'll have any rough calculations or tricks that you like to use to figure out when to write to disk for task such as this, e.g. size vs memory or something?

like image 537
mzbaran Avatar asked Feb 04 '26 04:02

mzbaran


2 Answers

Assuming your large file is simple unstructured text, i.e. this is no good for structured text like JSON, here's an alternative to reading every single line: read large binary bites of the input file until at your chunksize then read a couple of lines, close the current output file and move on to the next.

I compared this with line-by-line using @tdelaney code adapted with the same chunksize as my code - that code took 250s to split a 12GiB input file into 6x2GiB chunks, whereas this took ~50s so maybe five times faster and looks like it's I/O bound on my SSD running >200MiB/s read and write, where the line-by-line was running 40-50MiB/s read and write.

I turned buffering off because there's not a lot of point. The size of bite and the buffering setting may be tuneable to improve performance, haven't tried any other settings as for me it seems to be I/O bound anyway.

import time

outfile_template = "outfile-{}.txt"
infile_name = "large.text"
chunksize = 2_000_000_000
MEB = 2**20   # mebibyte
bitesize = 4_000_000 # the size of the reads (and writes) working up to chunksize

count = 0

starttime = time.perf_counter()

infile = open(infile_name, "rb", buffering=0)
outfile = open(outfile_template.format(count), "wb", buffering=0)

while True:
    byteswritten = 0
    while byteswritten < chunksize:
        bite = infile.read(bitesize)
        # check for EOF
        if not bite:
            break
        outfile.write(bite)
        byteswritten += len(bite)
    # check for EOF
    if not bite:
        break
    for i in range(2):
        l = infile.readline()
        # check for EOF
        if not l:
            break
        outfile.write(l)
    # check for EOF
    if not l:
        break
    outfile.close()
    count += 1
    print( count )
    outfile = open(outfile_template.format(count), "wb", buffering=0)

outfile.close()
infile.close()

endtime = time.perf_counter()

elapsed = endtime-starttime

print( f"Elapsed= {elapsed}" )

NOTE I haven't exhaustively tested this doesn't lose data, although no evidence it does lose anything you should validate that yourself.

Might be useful to add some robustness by checking when at the end of a chunk to see how much data is left to read, so you don't end up with the last output file being 0-length (or shorter than bitesize)

HTH barny

like image 119
DisappointedByUnaccountableMod Avatar answered Feb 06 '26 17:02

DisappointedByUnaccountableMod


I used slightly modificated version of this for parsing 250GB json, I choose how many smaller files I need number_of_slices and then I find positions where to slice a file (I always look for line end). FInally i slice file with file.seek and file.read(chunk)

import os
import mmap


FULL_PATH_TO_FILE = 'full_path_to_a_big_file'
OUTPUT_PATH = 'full_path_to_a_output_dir' # where sliced files will be generated


def next_newline_finder(mmapf):
    def nl_find(mmapf):
        while 1:
            current = hex(mmapf.read_byte())
            if hex(ord('\n')) == current:  # or whatever line-end symbol
                return(mmapf.tell())
    return nl_find(mmapf)


# find positions where to slice a file
file_info = os.stat(FULL_PATH_TO_FILE)
file_size = file_info.st_size
positions_for_file_slice = [0]
number_of_slices = 15  # say u want slice the big file to 15 smaller files
size_per_slice = file_size//number_of_slices

with open(FULL_PATH_TO_FILE, "r+b") as f:
    mmapf = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
    slice_counter = 1
    while slice_counter < number_of_slices:
        pos = size_per_slice*slice_counter
        mmapf.seek(pos)
        newline_pos = next_newline_finder(mmapf)
        positions_for_file_slice.append(newline_pos)
        slice_counter += 1

# create ranges for found positions (from, to)
positions_for_file_slice = [(pos, positions_for_file_slice[i+1]) if i < (len(positions_for_file_slice)-1) else (
    positions_for_file_slice[i], file_size) for i, pos in enumerate(positions_for_file_slice)]


# do actual slice of a file
with open(FULL_PATH_TO_FILE, "rb") as f:
    for i, position_pair in enumerate(positions_for_file_slice):
        read_from, read_to = position_pair
        f.seek(read_from)
        chunk = f.read(read_to-read_from)
        with open(os.path.join(OUTPUT_PATH, f'dummyfile{i}.json'), 'wb') as chunk_file:
            chunk_file.write(chunk)
like image 35
Jiří Oujezdský Avatar answered Feb 06 '26 16:02

Jiří Oujezdský