I've been trying to convert a large file with many lines (27 billion) to JSON. Google Compute recommends that I take advantage of multithreading to improve write times. I've converted my code from this:
import json
import progressbar
f = open('output.txt', 'r')
r = open('json.txt', 'w')
import math
num_permutations = (math.factorial(124)/math.factorial((124-5)))
main_bar = progressbar.ProgressBar(maxval=num_permutations, \
widgets=[progressbar.Bar('=', '[', ']'), ' ', progressbar.Percentage(), progressbar.AdaptiveETA()])
main_bar.start()
m = 0
for lines in f:
x = lines[:-1].split(' ')
x = json.dumps(x)
x += '\n'
r.write(x)
m += 1
main_bar.update(m)
to this:
import json
import progressbar
from Queue import Queue
import threading
q = Queue(maxsize=5)
def worker():
while True:
task = q.get()
r.write(task)
q.task_done()
for i in range(4):
t = threading.Thread(target=worker)
t.daemon = True
t.start()
f = open('output.txt', 'r')
r = open('teams.txt', 'w')
import math
num_permutations = (math.factorial(124)/math.factorial((124-5)))
main_bar = progressbar.ProgressBar(maxval=num_permutations, \
widgets=[progressbar.Bar('=', '[', ']'), ' ', progressbar.Percentage(), progressbar.AdaptiveETA()])
main_bar.start()
m = 0
for lines in f:
x = lines[:-1].split(' ')
x = json.dumps(x)
x += '\n'
q.put(x)
m += 1
main_bar.update(m)
I've copied the Queue coding pretty much straight from the module manual.
Before, the whole script would take 2 days. Now it is saying 20 days! I'm not quite sure why, could anyone explain this to me?
EDIT: This could be considered a Python Global Interpreter Lock (GIL) problem, however, I don't think it is so - it is not computationally intensive and is an IO bottleneck problem, from the threading docs:
If you want your application to make better use of the computational resources of multi-core machines, you are advised to use multiprocessing. However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously.
My understanding of this is limited, but I believe this to be the latter, ie. an IO bound task. This was my original thought when I wanted to go for multi-threading in the first place: That the computation was being blocked by IO calls that could be put to a separate thread to allow the computation functions to continue.
FURTHER EDIT: Perhaps the fact is that I've got an IO block from the INPUT, and that is what is slowing it down. Any ideas on how I could effectively send the 'for' loop to a separate thread? Thanks!
If we remove the progressbar code then your code is equivalent to:
#!/usr/bin/env python2
import json
import sys
for line in sys.stdin:
json.dump(line.split(), sys.stdout) # split on any whitespace
print
To improve the time performance, you should measure it first -- take a small input file so that the execution is no more than a minute and run:
$ /usr/bin/time ./your-script < output.txt > json.txt
I don't know why do you think writing binary blobs from multiple threads to the same file should be any faster.
What are the candidates for the performance bottleneck here:
json.dump() (unlikely but measure it anyway) -- play with parameters such as ensure_ascii=False and measure the results. Try other json modules, different Python implementations.iotop, csysdig to see how the process consumes resources)If you want to speed this up, don't use Python at all--the task is simple enough to handle with Unix filters, for example:
sed 's/ /", "/g; s/^/["/; s/$/"]/' output.txt > json.txt
For an explanation of how this works, see here: https://stackoverflow.com/a/14427404/4323
The problem with using Python for this, if you care about speed, is that you're fundamentally reading one line at a time from the input file. Now, there are fancy ways you could do the reading (divide and conquer), but if you just want to speed it up, the above should do the trick.
If you want a progress bar, use pv (from package moreutils on many Linux systems) which I think would go this way:
pv output.txt | sed 's/ /", "/g; s/^/["/; s/$/"]/' > json.txt
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With