Python multithreading - writing to file is 10x slower

Question

I've been trying to convert a large file with many lines (27 billion) to JSON. Google Compute recommends that I take advantage of multithreading to improve write times. I've converted my code from this:

import json
import progressbar
f = open('output.txt', 'r')
r = open('json.txt', 'w')
import math
num_permutations = (math.factorial(124)/math.factorial((124-5)))
main_bar = progressbar.ProgressBar(maxval=num_permutations, \
widgets=[progressbar.Bar('=', '[', ']'), ' ', progressbar.Percentage(), progressbar.AdaptiveETA()])
main_bar.start()
m = 0
for lines in f:
        x = lines[:-1].split(' ')
        x = json.dumps(x)
        x += '
'
        r.write(x)
        m += 1
        main_bar.update(m)

to this:

import json
import progressbar
from Queue import Queue
import threading
q = Queue(maxsize=5)
def worker():
        while True:
                task = q.get()
                r.write(task)
                q.task_done()
for i in range(4):
        t = threading.Thread(target=worker)
        t.daemon = True
        t.start()
f = open('output.txt', 'r')
r = open('teams.txt', 'w')
import math
num_permutations = (math.factorial(124)/math.factorial((124-5)))
main_bar = progressbar.ProgressBar(maxval=num_permutations, \
widgets=[progressbar.Bar('=', '[', ']'), ' ', progressbar.Percentage(), progressbar.AdaptiveETA()])
main_bar.start()
m = 0
for lines in f:
        x = lines[:-1].split(' ')
        x = json.dumps(x)
        x += '
'
        q.put(x)
        m += 1
        main_bar.update(m)

I've copied the Queue coding pretty much straight from the module manual.

Before, the whole script would take 2 days. Now it is saying 20 days! I'm not quite sure why, could anyone explain this to me?

EDIT: This could be considered a Python Global Interpreter Lock (GIL) problem, however, I don't think it is so - it is not computationally intensive and is an IO bottleneck problem, from the threading docs:

If you want your application to make better use of the computational resources of multi-core machines, you are advised to use multiprocessing. However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously.

My understanding of this is limited, but I believe this to be the latter, ie. an IO bound task. This was my original thought when I wanted to go for multi-threading in the first place: That the computation was being blocked by IO calls that could be put to a separate thread to allow the computation functions to continue.

FURTHER EDIT: Perhaps the fact is that I've got an IO block from the INPUT, and that is what is slowing it down. Any ideas on how I could effectively send the 'for' loop to a separate thread? Thanks!

jfs · Accepted Answer

If we remove the progressbar code then your code is equivalent to:

#!/usr/bin/env python2
import json
import sys

for line in sys.stdin:        
    json.dump(line.split(), sys.stdout) # split on any whitespace
    print

To improve the time performance, you should measure it first -- take a small input file so that the execution is no more than a minute and run:

$ /usr/bin/time ./your-script < output.txt > json.txt

I don't know why do you think writing binary blobs from multiple threads to the same file should be any faster.

What are the candidates for the performance bottleneck here:

the loop overhead (if lines are small and disks are fast). Put the code inside a function (it may improve performance on CPython due to replacing global lookup with local)
json.dump() (unlikely but measure it anyway) -- play with parameters such as ensure_ascii=False and measure the results. Try other json modules, different Python implementations.
disk I/O -- put the result file on a different physical disk (run something like iotop, csysdig to see how the process consumes resources)
unicode <> bytes conversions, EOL conversions -- open files in binary mode, encode json text to bytes

John Zwinck · Answer

If you want to speed this up, don't use Python at all--the task is simple enough to handle with Unix filters, for example:

sed 's/ /", "/g; s/^/["/; s/$/"]/' output.txt > json.txt

For an explanation of how this works, see here: https://stackoverflow.com/a/14427404/4323

The problem with using Python for this, if you care about speed, is that you're fundamentally reading one line at a time from the input file. Now, there are fancy ways you could do the reading (divide and conquer), but if you just want to speed it up, the above should do the trick.

If you want a progress bar, use pv (from package moreutils on many Linux systems) which I think would go this way:

pv output.txt | sed 's/ /", "/g; s/^/["/; s/$/"]/' > json.txt

Python multithreading - writing to file is 10x slower

Tags:

python

multithreading

A A Karim

2 Answers

jfs

John Zwinck

Recent Activity

Donate For Us

Python multithreading - writing to file is 10x slower

Tags:

python

multithreading

A A Karim

2 Answers

jfs

John Zwinck

Related questions

Recent Activity

Donate For Us