Time performance in Generating very large text file in Python

Tags:

I need to generate a very large text file. Each line has a simple format:

Seq_num<SPACE>num_val
12343234 759

Let's assume I am going to generate a file with 100million lines. I tried 2 approaches and surprisingly they are giving very different time performance.

For loop over 100m. In each loop I make short string of seq_num<SPACE>num_val, and then I write that to a file. This approach takes a lot of time.
```
## APPROACH 1  
for seq_id in seq_ids:
    num_val=rand()
    line=seq_id+' '+num_val
    data_file.write(line)
```
For loop over 100m. In each loop I make short string of seq_num<SPACE>num_val, and then I append this to a list. When loop finishes, I iterate over list items and write each item to a file. This approach takes far less time.
```
## APPROACH 2  
data_lines=list()
for seq_id in seq_ids:
    num_val=rand()
    l=seq_id+' '+num_val
    data_lines.append(l)
for line in data_lines:
    data_file.write(line)
```

Note that:

Approach 2 has 2 loops instead of 1 loop.
I write to file in loop for both approach 1 and approach 2. So this step must be same for both.

So approach 1 must take less time. Any hints what I am missing?

503

asked Mar 13 '18 22:03

doubleE

1 Answers

A lot and far less are technically very vague terms :) Basically if you can't measure it, you can't improve it.

For simplicity let's have a simple benchmark, loop1.py:

import random
from datetime import datetime

start = datetime.now()
data_file = open('file.txt', 'w')
for seq_id in range(0, 1000000):
        num_val=random.random()
        line="%i %f\n" % (seq_id, num_val)
        data_file.write(line)

end = datetime.now()
print("elapsed time %s" % (end - start))

loop2.py with 2 for loops:

import random
from datetime import datetime

start = datetime.now()
data_file = open('file.txt', 'w')
data_lines=list()
for seq_id in range(0, 1000000):
    num_val=random.random()
    line="%i %f\n" % (seq_id, num_val)
    data_lines.append(line)
for line in data_lines:
    data_file.write(line)

end = datetime.now()
print("elapsed time %s" % (end - start))

When I run these two scripts on my computers (with SSD drive) I'm getting something like:

$ python3 loop1.py 
elapsed time 0:00:00.684282
$ python3 loop2.py 
elapsed time 0:00:00.766182

Each measurement might be slightly different, but as would intuition suggest, the second one is slightly slower.

If we want to optimize writing time, we need to check the manual how Python implements writing into files. For text files the open() function should use BufferedWriter.The open function accepts 3rd arguments that is the buffer size. Here's the interesting part:

Pass 0 to switch buffering off (only allowed in binary mode), 1 to select line buffering (only usable in text mode), and an integer > 1 to indicate the size in bytes of a fixed-size chunk buffer. When no buffering argument is given, the default buffering policy works as follows:

Binary files are buffered in fixed-size chunks; the size of the buffer is chosen using a heuristic trying to determine the underlying device’s “block size” and falling back on io.DEFAULT_BUFFER_SIZE. On many systems, the buffer will typically be 4096 or 8192 bytes long.

So, we can modify the loop1.py and use line buffering:

data_file = open('file.txt', 'w', 1)

this turns out to be very slow:

$ python3 loop3.py 
elapsed time 0:00:02.470757

In order to optimize writing time, we can adjust the buffer size to our needs. First we check the line size in bytes: len(line.encode('utf-8')), that gives me 11 bytes.

After updating the buffer size to our expected line size in bytes:

data_file = open('file.txt', 'w', 11)

I'm getting quite fast writes:

elapsed time 0:00:00.669622

Based on the details you've provided it's hard to estimate what's going on. Maybe the heuristic for estimating block size doesn't work well on your computer. Anyway if you're writing fixed line length, it's easy to optimize the buffer size. You could further optimize writing to files by leveraging flush().

Conclusion: Generally for faster writes into a file you should try to write a bulk of data that corresponds to a block size on your file system - which is exactly what is the Python method open('file.txt', 'w') is trying to do. In most cases you're safe with the defaults, differences in microbenchmarks are insignificant.

You're allocating large number of string objects, that needs to be collected by the GC. As suggested by @kevmo314, in order to perform a fair comparison you should disable the GC for loop1.py:

gc.disable()

As the GC might try to remove string objects while iterating over the loop (you're not keeping any reference). While the seconds approach keeps references to all string objects and GC collects them at the end.

170

answered Oct 24 '22 18:10

Tombart

Related questions
                            
                                how to set autocommit = 1 in a sqlalchemy.engine.Connection
                            
                                'str' does not support the buffer interface Python3 from Python2
                            
                                Error: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte
                            
                                Calculate correlation between all columns of a DataFrame and all columns of another DataFrame?
                            
                                python scikit error - no module named sklearn
                            
                                Pandas Efficient VWAP Calculation
                            
                                urllib.request module fails to install in my system
                            
                                how to redirect to a external 404 page python flask
                            
                                Sort a list to form the largest possible number
                            
                                Python - Split url into its components
                            
                                How do I plot a pie chart using Pandas with this data
                            
                                My IDLE does not recognize itertools.izip() as a function
                            
                                Selenium PhantomJS custom headers in Python
                            
                                Python High Pass Filter
                            
                                Differentiate False and 0
                            
                                Pass all arguments of a function to another function
                            
                                ValueError: cannot reshape array of size 30470400 into shape (50,1104,104)
                            
                                Python anaconda access denied
                            
                                No module named 'cv2'
                            
                                Seaborn Heatmap: Move colorbar on top of the plot

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Time performance in Generating very large text file in Python

Tags:

performance

python

algorithm

large-data

large-files

doubleE

People also ask

1 Answers

Tombart

Recent Activity

Donate For Us