Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read in large file and make dictionary

I have a large file which I need to read in and make a dictionary from. I would like this to be as fast as possible. However my code in python is too slow. Here is a minimal example that shows the problem.

First make some fake data

paste <(seq 20000000) <(seq 2 20000001)  > largefile.txt

Now here is a minimal piece of python code to read it in and make a dictionary.

import sys
from collections import defaultdict
fin = open(sys.argv[1])

dict = defaultdict(list)

for line in fin:
    parts = line.split()
    dict[parts[0]].append(parts[1])

Timings:

time ./read.py largefile.txt
real    0m55.746s

However it is not I/O bound as:

time cut -f1 largefile.txt > /dev/null    
real    0m1.702s

If I comment out the dict line it takes 9 seconds. It seems that almost all the time is spent by dict[parts[0]].append(parts[1]).

Is there any way to speed this up? I don't mind using cython or even C if that is going to make a big difference. Or can pandas help here?

Here is the profile output on a file of size 10000000 lines.

python -m cProfile read.py test.data         20000009 function calls in 42.494 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 bisect.py:1(<module>)
        1    0.000    0.000    0.001    0.001 collections.py:1(<module>)
        1    0.000    0.000    0.000    0.000 collections.py:25(OrderedDict)
        1    0.000    0.000    0.000    0.000 collections.py:386(Counter)
        1    0.000    0.000    0.000    0.000 heapq.py:31(<module>)
        1    0.000    0.000    0.000    0.000 keyword.py:11(<module>)
        1   30.727   30.727   42.494   42.494 read.py:2(<module>)
 10000000    4.855    0.000    4.855    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
 10000000    6.912    0.000    6.912    0.000 {method 'split of 'str' objects}
        1    0.000    0.000    0.000    0.000 {open}

Update. We can assume that parts[1] is an integer and that parts[0] is a short fixed length string.

My fake data isn't very good as you only get one value per key. Here is a better version.

perl -E 'say int rand 1e7, $", int rand 1e4 for 1 .. 1e7' > largefile.txt

The only operation I will do is to query a key to return the list of values associated with it.

like image 389
graffe Avatar asked Aug 06 '13 17:08

graffe


2 Answers

If you want the thing you said in the comment, then you can do it easily in pandas: Let's say you have a file with the same layout but the entries get duplicated, since in your example you add all the duplicates into a list:

1 1
2 2
1 3
3 4
1 5
5 6

Then you can read and manipulate the data:

In [1]: df = pd.read_table('largefile.txt', header=None, index_col=0)
In [2]: df.loc[2]
Out[2]:
1    2
Name: 2, dtype: int64

In [3]: df.loc[1]
Out[3]:
   1
0
1  1
1  3
1  5

Pandas stores everything in DataFrames and Series objects which are indexed so don't bother a lot about the output, the first column is the index and the second column is the important one and it will give you the numbers you need.

You can do a lot more with pandas though... For example you can group by the first column in your file and perform aggregations:

In [64]: df = pd.read_table('largefile.txt', header=None).groupby(0)
In [65]: df.sum()
Out[65]:
   1
0
1  9
2  2
3  4
5  6
In [66]: df.mean()
Out[66]:
   1
0
1  3
2  2
3  4
5  6    
In [67]: df[0].count()
Out[67]:
0
1    3
2    1
3    1
5    1
dtype: int64

I know that this is not the answer to how to speed up the dictionary thing, but from what you mentioned in the comment, this could be an alternate solution.

Edit - Added timing

Compared to the fastest dictionary solution and loading data into pandas DataFrame:

test_dict.py

import sys
d = {}
with open(sys.argv[1]) as fin:
    for line in fin:
        parts = line.split(None, 1)
        d[parts[0]] = d.get(parts[0], []) + [parts[1]]

test_pandas.py

import sys
import pandas as pd
df = pd.read_table(sys.argv[1], header=None, index_col=0)

Timed on a linux machine:

$ time python test_dict.py largefile.txt
real    1m13.794s
user    1m10.148s
sys     0m3.075s

$ time python test_pandas.py largefile.txt
real    0m10.937s
user    0m9.819s
sys     0m0.504s

Edit: for the new example file

In [1]: import pandas as pd
In [2]: df = pd.read_table('largefile.txt', header=None,
                           sep=' ', index_col=0).sort_index()
In [3]: df.index
Out[3]: Int64Index([0, 1, 1, ..., 9999998, 9999999, 9999999], dtype=int64)
In [4]: df[1][0]
Out[4]: 6301
In [5]: df[1][1].values
Out[5]: array([8936, 5983])
like image 74
Viktor Kerkez Avatar answered Oct 16 '22 12:10

Viktor Kerkez


Here are a few quick performance improvements I managed to get:

Using a plain dict instead of defaultdict, and changing d[parts[0]].append(parts[1]) to d[parts[0]] = d.get(parts[0], []) + [parts[1]], cut the time by 10%. I don't know whether it's eliminating all those calls to a Python __missing__ function, not mutating the lists in-place, or something else that deserves the credit.

Just using setdefault on a plain dict instead of defaultdict also cuts the time by 8%, which implies that it's the extra dict work rather than the in-place appends.

Meanwhile, replacing the split() with split(None, 1) helps by 9%.

Running in PyPy 1.9.0 instead of CPython 2.7.2 cut the time by 52%; PyPy 2.0b by 55%.

If you can't use PyPy, CPython 3.3.0 cut the time by 9%.

Running in 32-bit mode instead of 64-bit increased the time by 170%, which implies that if you're using 32-bit you may want to switch.


The fact that the dict takes over 2GB to store (slightly less in 32-bit) is probably a big part of the problem. The only real alternative is to store everything on disk. (In a real-life app you'd probably want to manage an in-memory cache, but here, you're just generating the data and quitting, which makes things simpler.) Whether this helps depends on a number of factors. I suspect that on a system with an SSD and not much RAM it'll speed things up, while on a system with a 5400rpm hard drive and 16GB of RAM (like the laptop I'm using at the moment) it won't… But depending on your system's disk cache, etc., who knows, without testing.

There's no quick&dirty way to store lists of strings in disk-based storage (shelve will presumably waste more time with the pickling and unpickling than it saves), but changing it to just concatenate strings instead and using gdbm kept the memory usage down below 200MB and finished in about the same time, and has the nice side effect that if you want to use the data more than once, you've got them stored persistently. Unfortunately, plain old dbm wouldn't work because the default page size is too small for this many entries, and the Python interface doesn't provide any way to override the default.

Switching to a simple sqlite3 database that just has non-unique Key and Value columns and doing it in :memory: took about 80% longer, while on disk it took 85% longer. I suspect that denormalizing things to store multiple values with each key wouldn't help, and would in fact make things worse. (Still, for many real life uses, this may be a better solution.)


Meanwhile, wrapping cProfile around your main loop:

         40000002 function calls in 31.222 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   21.770   21.770   31.222   31.222 <string>:2(<module>)
 20000000    2.373    0.000    2.373    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
 20000000    7.079    0.000    7.079    0.000 {method 'split' of 'str' objects}

So, that's one third of your time spent in string.split, 10% spent in append, and the rest spend it code that cProfile couldn't see, which here includes both iterating the file and the defaultdict method calls.

Switching to a regular dict with setdefault (which, remember, was a little faster) shows 3.774 seconds spent in setdefault, so that's about 15% of the time, or presumably around 20% for the defaultdict version. Presuambly the __setitem__ method isn't going to be any worse than the setdefault or defaultdict.__getitem__ were.

However, we may not be seeing the time charged by malloc calls here, and they may be a huge chunk of the performance. To test that, you'll need a C-level profiler. So let's come back to that.

Meanwhile, at least some of the leftover time is probably taken up by the line-splitting as well, since that must cost on the same order as space-splitting, right? But I don't know of any way to improve that significantly.


Finally, a C-level profiler is going to help here, but one run on my system may not help much for your system, so I'll leave that to you.


The fastest version on my system depends on which Python I run, but it's either this:

d = {}    
for line in fin:
    parts = line.split(None, 1)
    d[parts[0]] = d.get(parts[0], []) + [parts[1]]

Or this:

d = {}    
for line in fin:
    parts = line.split(None, 1)
    d.setdefault(parts[0], []).append(parts[1])

… And they're both pretty close to each other.

The gdbm solution, which was about the same speed, and has obvious advantages and disadvantages, looks like this:

d = gdbm.open(sys.argv[1] + '.db', 'c')
for line in fin:
    parts = line.split(None, 1)
    d[parts[0]] = d.get(parts[0], '') + ',' + parts[1]

(Obviously if you want to be able to run this repeatedly, you will need to add a line to delete any pre-existing database—or, better, if it fits your use case, to check its timestamp against the input file's and skip the whole loop if it's already up-to-date.)

like image 37
abarnert Avatar answered Oct 16 '22 12:10

abarnert