Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read large file in parallel?

I have a large file which I need to read in and make a dictionary from. I would like this to be as fast as possible. However my code in python is too slow. Here is a minimal example that shows the problem.

First make some fake data

paste <(seq 20000000) <(seq 2 20000001)  > largefile.txt

Now here is a minimal piece of python code to read it in and make a dictionary.

import sys
from collections import defaultdict
fin = open(sys.argv[1])

dict = defaultdict(list)

for line in fin:
    parts = line.split()
    dict[parts[0]].append(parts[1])

Timings:

time ./read.py largefile.txt
real    0m55.746s

However it is possible to read the whole file much faster as:

time cut -f1 largefile.txt > /dev/null    
real    0m1.702s

My CPU has 8 cores, is it possible to parallelize this program in python to speed it up?

One possibility might be to read in large chunks of the input and then run 8 processes in parallel on different non-overlapping subchunks making dictionaries in parallel from the data in memory then read in another large chunk. Is this possible in python using multiprocessing somehow?

Update. The fake data was not very good as it had only one value per key. Better is

perl -E 'say int rand 1e7, $", int rand 1e4 for 1 .. 1e7' > largefile.txt

(Related to Read in large file and make dictionary .)

like image 469
graffe Avatar asked Aug 07 '13 13:08

graffe


4 Answers

It may be possible to parallelize this to speed it up, but doing multiple reads in parallel is unlikely to help.

Your OS is unlikely to usefully do multiple reads in parallel (the exception is with something like a striped raid array, in which case you still need to know the stride to make optimal use of it).

What you can do, is run the relatively expensive string/dictionary/list operations in parallel to the read.

So, one thread reads and pushes (large) chunks to a synchronized queue, one or more consumer threads pulls chunks from the queue, split them into lines, and populate the dictionary.

(If you go for multiple consumer threads, as Pappnese says, build one dictionary per thread and then join them).


Hints:

  • ... push chunks to a synchronized queue ...
  • ... one or more consumer threads ...

Re. bounty:

C obviously doesn't have the GIL to contend with, so multiple consumers are likely to scale better. The read behaviour doesn't change though. The down side is that C lacks built-in support for hash maps (assuming you still want a Python-style dictionary) and synchronized queues, so you have to either find suitable components or write your own. The basic strategy of multiple consumers each building their own dictionary and then merging them at the end is still likely the best.

Using strtok_r instead of str.split may be faster, but remember you'll need to manage the memory for all your strings manually too. Oh, and you need logic to manage line fragments too. Honestly C gives you so many options I think you'll just need to profile it and see.

like image 200
Useless Avatar answered Nov 03 '22 08:11

Useless


It does seem tempting to think that using a processing pool will solve problems like this, but it's going to end up being a good bit more complicated than that, at least in pure Python.

Because the OP mentioned that the lists on each input line would be longer in practice than two elements, I made a slightly-more-realistic input file using :

paste <(seq 20000000) <(seq 2 20000001) <(seq 3 20000002) |
  head -1000000 > largefile.txt

After profiling the original code, I found the slowest portion of the process to be the line-splitting routine. (.split() took approximately 2x longer than .append() on my machine.)

1000000    0.333    0.000    0.333    0.000 {method 'split' of 'str' objects}
1000000    0.154    0.000    0.154    0.000 {method 'append' of 'list' objects}

So I factored the split into another function and use a pool to distribute the work of splitting the fields :

import sys
import collections
import multiprocessing as mp

d = collections.defaultdict(list)

def split(l):
    return l.split()

pool = mp.Pool(processes=4)
for keys in pool.map(split, open(sys.argv[1])):
    d[keys[0]].append(keys[1:])

Unfortunately, adding the pool slowed things down by more than 2x. The original version looked like this :

$ time python process.py smallfile.txt 
real    0m7.170s
user    0m6.884s
sys     0m0.260s

versus the parallel version :

$ time python process-mp.py smallfile.txt 
real    0m16.655s
user    0m24.688s
sys     0m1.380s

Because the .map() call basically has to serialize (pickle) each input, send it to the remote process, and then deserialize (unpickle) the return value from the remote process, using a pool in this way is much slower. You do get some improvement by adding more cores to the pool, but I'd argue that this is fundamentally the wrong way to distribute this work.

To really speed this up across cores, my guess is that you'd need to read in large chunks of the input using some sort of fixed block size. Then you could send the entire block to a worker process and get serialized lists back (though it's still unknown how much the deserialization here will cost you). Reading the input in fixed-size blocks sounds like it might be tricky with the anticipated input, however, since my guess is that each line isn't necessarily the same length.

like image 44
lmjohns3 Avatar answered Nov 03 '22 07:11

lmjohns3


There was a blog post series "Wide Finder Project" several years ago about this at Tim Bray's site [1]. You can find there a solution [2] by Fredrik Lundh of ElementTree [3] and PIL [4] fame. I know posting links is generally discouraged at this site but I think these links give you better answer than copy-pasting his code.

[1] http://www.tbray.org/ongoing/When/200x/2007/10/30/WF-Results
[2] http://effbot.org/zone/wide-finder.htm
[3] http://docs.python.org/3/library/xml.etree.elementtree.html
[4] http://www.pythonware.com/products/pil/

like image 39
Ecir Hana Avatar answered Nov 03 '22 08:11

Ecir Hana


One thing you could try is to get the line count from the file, then spawn 8 threads that makes a dictionary from 1/8th of the file each, then join the dictionaries when all threads are finished. This will probably speed it up if it is the appending that takes time and not the reading of the lines.

like image 35
oyhovd Avatar answered Nov 03 '22 07:11

oyhovd