I asked a related but very general question earlier (see especially this response).
This question is very specific. This is all the code I care about:
result = {}
for line in open('input.txt'):
key, value = parse(line)
result[key] = value
The function parse
is completely self-contained (i.e., doesn't use any shared resources).
I have Intel i7-920 CPU (4 cores, 8 threads; I think the threads are more relevant, but I'm not sure).
What can I do to make my program use all the parallel capabilities of this CPU?
I assume I can open this file for reading in 8 different threads without much performance penalty since disk access time is small relative to the total time.
This can be done using Ray, which is a library for writing parallel and distributed Python.
To run the code below, first create input.txt
as follows.
printf "1\n2\n3\n4\n5\n6\n" > input.txt
Then you can process the file in parallel by adding the @ray.remote
decorator to the parse
function and executing many copies in parallel as follows
import ray
import time
ray.init()
@ray.remote
def parse(line):
time.sleep(1)
return 'key' + str(line), 'value'
# Submit all of the "parse" tasks in parallel and wait for the results.
keys_and_values = ray.get([parse.remote(line) for line in open('input.txt')])
# Create a dictionary out of the results.
result = dict(keys_and_values)
Note that the optimal way to do this will depend on how long it takes to run the parse
function. If it takes one second (as above), then parsing one line per Ray task makes sense. If it takes 1 millisecond, then it probably makes sense to parse a bunch of lines (e.g., 100) per Ray task.
Your script is simple enough that the multiprocessing module can also be used, however as soon as you want to do anything more complicated or want to leverage multiple machines instead of just one machine, then it will be much easier with Ray.
See the Ray documentation.
cPython does not provide the threading model you are looking for easily. You can get something similar using the multiprocessing
module and a process pool
such a solution could look something like this:
def worker(lines):
"""Make a dict out of the parsed, supplied lines"""
result = {}
for line in lines.split('\n'):
k, v = parse(line)
result[k] = v
return result
if __name__ == '__main__':
# configurable options. different values may work better.
numthreads = 8
numlines = 100
lines = open('input.txt').readlines()
# create the process pool
pool = multiprocessing.Pool(processes=numthreads)
# map the list of lines into a list of result dicts
result_list = pool.map(worker,
(lines[line:line+numlines] for line in xrange(0,len(lines),numlines) ) )
# reduce the result dicts into a single dict
result = {}
map(result.update, result_list)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With