Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sort a big file with Python heapq.merge

Tags:

python

sorting

I'm looking to complete such job but have encountered difficulty:

I have a huge file of texts. Each line is of the format "AGTCCCGGAT filename" where the first part is a DNA thing.

The professor suggests that we break this huge file into many temporary files and use heapq.merge() to sort them. The goal is to have one file at the end which contains every line of the original file and is sorted.

My first try was to break each line into a separate temporary file. The problem is that heapq.merge() reports there are too many files to sort.

My second try was to break it into temporary files by 50000 lines. The problem is that it seems that it does not sort by line, but by file. for example, we have something like:

ACGTACGT filename
CGTACGTA filename
ACGTCCGT filename
CGTAAAAA filename

where the first two lines are from one temp file and the last two lines are from the second file.

My code to sort them is as follows:

for line in heapq.merge(*[open('/var/tmp/L._Ipsum-strain01.fa_dir/'+str(f),'r') for f in os.listdir('/var/tmp/L._Ipsum-strain01.fa_dir')]):
     result.write(line)
result.close()
like image 976
Destino Avatar asked Mar 20 '23 17:03

Destino


1 Answers

Your solution is almost correct. However, each partial file must be sorted before you write them to the disk. Here's a 2-pass algorithm that demonstrates it: First, iterate the file in 50k line chunks, sort the lines in a chunk and then write this sorted chunk into a file. In second pass, open all these files and merge to the output file.

from heapq import merge
from itertools import count, islice
from contextlib import ExitStack  # not available on Python 2
                                  # need to care for closing files otherwise

chunk_names = []

# chunk and sort
with open('input.txt') as input_file:
    for chunk_number in count(1):
        # read in next 50k lines and sort them
        sorted_chunk = sorted(islice(input_file, 50000))
        if not sorted_chunk:
            # end of input
            break

        chunk_name = 'chunk_{}.chk'.format(chunk_number)
        chunk_names.append(chunk_name)
        with open(chunk_name, 'w') as chunk_file:
            chunk_file.writelines(sorted_chunk)

with ExitStack() as stack, open('output.txt', 'w') as output_file:
    files = [stack.enter_context(open(chunk)) for chunk in chunk_names]
    output_file.writelines(merge(*files))
like image 126