Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Time performance in Generating very large text file in Python

I need to generate a very large text file. Each line has a simple format:

Seq_num<SPACE>num_val
12343234 759

Let's assume I am going to generate a file with 100million lines. I tried 2 approaches and surprisingly they are giving very different time performance.

  1. For loop over 100m. In each loop I make short string of seq_num<SPACE>num_val, and then I write that to a file. This approach takes a lot of time.

    ## APPROACH 1  
    for seq_id in seq_ids:
        num_val=rand()
        line=seq_id+' '+num_val
        data_file.write(line)
    
  2. For loop over 100m. In each loop I make short string of seq_num<SPACE>num_val, and then I append this to a list. When loop finishes, I iterate over list items and write each item to a file. This approach takes far less time.

    ## APPROACH 2  
    data_lines=list()
    for seq_id in seq_ids:
        num_val=rand()
        l=seq_id+' '+num_val
        data_lines.append(l)
    for line in data_lines:
        data_file.write(line)
    

Note that:

  • Approach 2 has 2 loops instead of 1 loop.
  • I write to file in loop for both approach 1 and approach 2. So this step must be same for both.

So approach 1 must take less time. Any hints what I am missing?

like image 503
doubleE Avatar asked Mar 13 '18 22:03

doubleE


People also ask

How do I read a 10gb file in Python?

In Python, files are read by using the readlines() method. The readlines() method returns a list where each item of the list is a complete sentence in the file. This method is useful when the file size is small.

How does Python handle large files?

Reading Large Text Files in Python We can use the file object as an iterator. The iterator will return each line one by one, which can be processed. This will not read the whole file into memory and it's suitable to read large files in Python.


1 Answers

A lot and far less are technically very vague terms :) Basically if you can't measure it, you can't improve it.

For simplicity let's have a simple benchmark, loop1.py:

import random
from datetime import datetime

start = datetime.now()
data_file = open('file.txt', 'w')
for seq_id in range(0, 1000000):
        num_val=random.random()
        line="%i %f\n" % (seq_id, num_val)
        data_file.write(line)

end = datetime.now()
print("elapsed time %s" % (end - start))

loop2.py with 2 for loops:

import random
from datetime import datetime

start = datetime.now()
data_file = open('file.txt', 'w')
data_lines=list()
for seq_id in range(0, 1000000):
    num_val=random.random()
    line="%i %f\n" % (seq_id, num_val)
    data_lines.append(line)
for line in data_lines:
    data_file.write(line)

end = datetime.now()
print("elapsed time %s" % (end - start))

When I run these two scripts on my computers (with SSD drive) I'm getting something like:

$ python3 loop1.py 
elapsed time 0:00:00.684282
$ python3 loop2.py 
elapsed time 0:00:00.766182

Each measurement might be slightly different, but as would intuition suggest, the second one is slightly slower.

If we want to optimize writing time, we need to check the manual how Python implements writing into files. For text files the open() function should use BufferedWriter.The open function accepts 3rd arguments that is the buffer size. Here's the interesting part:

Pass 0 to switch buffering off (only allowed in binary mode), 1 to select line buffering (only usable in text mode), and an integer > 1 to indicate the size in bytes of a fixed-size chunk buffer. When no buffering argument is given, the default buffering policy works as follows:

Binary files are buffered in fixed-size chunks; the size of the buffer is chosen using a heuristic trying to determine the underlying device’s “block size” and falling back on io.DEFAULT_BUFFER_SIZE. On many systems, the buffer will typically be 4096 or 8192 bytes long.

So, we can modify the loop1.py and use line buffering:

data_file = open('file.txt', 'w', 1)

this turns out to be very slow:

$ python3 loop3.py 
elapsed time 0:00:02.470757

In order to optimize writing time, we can adjust the buffer size to our needs. First we check the line size in bytes: len(line.encode('utf-8')), that gives me 11 bytes.

After updating the buffer size to our expected line size in bytes:

data_file = open('file.txt', 'w', 11)

I'm getting quite fast writes:

elapsed time 0:00:00.669622

Based on the details you've provided it's hard to estimate what's going on. Maybe the heuristic for estimating block size doesn't work well on your computer. Anyway if you're writing fixed line length, it's easy to optimize the buffer size. You could further optimize writing to files by leveraging flush().

Conclusion: Generally for faster writes into a file you should try to write a bulk of data that corresponds to a block size on your file system - which is exactly what is the Python method open('file.txt', 'w') is trying to do. In most cases you're safe with the defaults, differences in microbenchmarks are insignificant.

You're allocating large number of string objects, that needs to be collected by the GC. As suggested by @kevmo314, in order to perform a fair comparison you should disable the GC for loop1.py:

gc.disable()

As the GC might try to remove string objects while iterating over the loop (you're not keeping any reference). While the seconds approach keeps references to all string objects and GC collects them at the end.

like image 170
Tombart Avatar answered Oct 24 '22 18:10

Tombart