I have a large file which I need to read in and make a dictionary from. I would like this to be as fast as possible. However my code in python is too slow. Here is a minimal example that shows the problem.
First make some fake data
paste <(seq 20000000) <(seq 2 20000001) > largefile.txt
Now here is a minimal piece of python code to read it in and make a dictionary.
import sys
from collections import defaultdict
fin = open(sys.argv[1])
dict = defaultdict(list)
for line in fin:
parts = line.split()
dict[parts[0]].append(parts[1])
Timings:
time ./read.py largefile.txt
real 0m55.746s
However it is possible to read the whole file much faster as:
time cut -f1 largefile.txt > /dev/null
real 0m1.702s
My CPU has 8 cores, is it possible to parallelize this program in python to speed it up?
One possibility might be to read in large chunks of the input and then run 8 processes in parallel on different non-overlapping subchunks making dictionaries in parallel from the data in memory then read in another large chunk. Is this possible in python using multiprocessing somehow?
Update. The fake data was not very good as it had only one value per key. Better is
perl -E 'say int rand 1e7, $", int rand 1e4 for 1 .. 1e7' > largefile.txt
(Related to Read in large file and make dictionary .)
It may be possible to parallelize this to speed it up, but doing multiple reads in parallel is unlikely to help.
Your OS is unlikely to usefully do multiple reads in parallel (the exception is with something like a striped raid array, in which case you still need to know the stride to make optimal use of it).
What you can do, is run the relatively expensive string/dictionary/list operations in parallel to the read.
So, one thread reads and pushes (large) chunks to a synchronized queue, one or more consumer threads pulls chunks from the queue, split them into lines, and populate the dictionary.
(If you go for multiple consumer threads, as Pappnese says, build one dictionary per thread and then join them).
Hints:
Re. bounty:
C obviously doesn't have the GIL to contend with, so multiple consumers are likely to scale better. The read behaviour doesn't change though. The down side is that C lacks built-in support for hash maps (assuming you still want a Python-style dictionary) and synchronized queues, so you have to either find suitable components or write your own. The basic strategy of multiple consumers each building their own dictionary and then merging them at the end is still likely the best.
Using strtok_r
instead of str.split
may be faster, but remember you'll need to manage the memory for all your strings manually too. Oh, and you need logic to manage line fragments too. Honestly C gives you so many options I think you'll just need to profile it and see.
It does seem tempting to think that using a processing pool will solve problems like this, but it's going to end up being a good bit more complicated than that, at least in pure Python.
Because the OP mentioned that the lists on each input line would be longer in practice than two elements, I made a slightly-more-realistic input file using :
paste <(seq 20000000) <(seq 2 20000001) <(seq 3 20000002) |
head -1000000 > largefile.txt
After profiling the original code, I found the slowest portion of the process to be the line-splitting routine. (.split()
took approximately 2x longer than .append()
on my machine.)
1000000 0.333 0.000 0.333 0.000 {method 'split' of 'str' objects}
1000000 0.154 0.000 0.154 0.000 {method 'append' of 'list' objects}
So I factored the split into another function and use a pool to distribute the work of splitting the fields :
import sys
import collections
import multiprocessing as mp
d = collections.defaultdict(list)
def split(l):
return l.split()
pool = mp.Pool(processes=4)
for keys in pool.map(split, open(sys.argv[1])):
d[keys[0]].append(keys[1:])
Unfortunately, adding the pool slowed things down by more than 2x. The original version looked like this :
$ time python process.py smallfile.txt
real 0m7.170s
user 0m6.884s
sys 0m0.260s
versus the parallel version :
$ time python process-mp.py smallfile.txt
real 0m16.655s
user 0m24.688s
sys 0m1.380s
Because the .map()
call basically has to serialize (pickle) each input, send it to the remote process, and then deserialize (unpickle) the return value from the remote process, using a pool in this way is much slower. You do get some improvement by adding more cores to the pool, but I'd argue that this is fundamentally the wrong way to distribute this work.
To really speed this up across cores, my guess is that you'd need to read in large chunks of the input using some sort of fixed block size. Then you could send the entire block to a worker process and get serialized lists back (though it's still unknown how much the deserialization here will cost you). Reading the input in fixed-size blocks sounds like it might be tricky with the anticipated input, however, since my guess is that each line isn't necessarily the same length.
There was a blog post series "Wide Finder Project" several years ago about this at Tim Bray's site [1]. You can find there a solution [2] by Fredrik Lundh of ElementTree [3] and PIL [4] fame. I know posting links is generally discouraged at this site but I think these links give you better answer than copy-pasting his code.
[1] http://www.tbray.org/ongoing/When/200x/2007/10/30/WF-Results
[2] http://effbot.org/zone/wide-finder.htm
[3] http://docs.python.org/3/library/xml.etree.elementtree.html
[4] http://www.pythonware.com/products/pil/
One thing you could try is to get the line count from the file, then spawn 8 threads that makes a dictionary from 1/8th of the file each, then join the dictionaries when all threads are finished. This will probably speed it up if it is the appending that takes time and not the reading of the lines.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With