Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

fast method in Python to split a large text file using number of lines as input variable

I am splitting a text file using the number of lines as variable. I wrote this function in order to save in a temporary directory the spitted files. Each file has 4 millions of lines expect the last file.

import tempfile
from itertools import groupby, count

temp_dir = tempfile.mkdtemp()

def tempfile_split(filename, temp_dir, chunk=4000000):
    with open(filename, 'r') as datafile:
        groups = groupby(datafile, key=lambda k, line=count(): next(line) // chunk)
        for k, group in groups:
            output_name = os.path.normpath(os.path.join(temp_dir + os.sep, "tempfile_%s.tmp" % k))
            for line in group:
                with open(output_name, 'a') as outfile:
                    outfile.write(line)

the main problem is the speed of this function. In order to split one file of 8 million of lines in two files of 4 millions of line the time is than more of 30 min of my windows OS and Python 2.7

like image 759
Gianni Spear Avatar asked Feb 17 '23 00:02

Gianni Spear


1 Answers

       for line in group:
            with open(output_name, 'a') as outfile:
                outfile.write(line)

is opening the file, and writing one line, for each line in group. This is slow.

Instead, write once per group.

            with open(output_name, 'a') as outfile:
                outfile.write(''.join(group))
like image 183
unutbu Avatar answered Mar 11 '23 10:03

unutbu