I have a large file which I need to read in and make a dictionary from. I would like this to be as fast as possible. However my code in python is too slow. Here is a minimal example that shows the problem.
First make some fake data
paste <(seq 20000000) <(seq 2 20000001) > largefile.txt
Now here is a minimal piece of python code to read it in and make a dictionary.
import sys
from collections import defaultdict
fin = open(sys.argv[1])
dict = defaultdict(list)
for line in fin:
parts = line.split()
dict[parts[0]].append(parts[1])
Timings:
time ./read.py largefile.txt
real 0m55.746s
However it is not I/O bound as:
time cut -f1 largefile.txt > /dev/null
real 0m1.702s
If I comment out the dict
line it takes 9
seconds. It seems that almost all the time is spent by dict[parts[0]].append(parts[1])
.
Is there any way to speed this up? I don't mind using cython or even C if that is going to make a big difference. Or can pandas help here?
Here is the profile output on a file of size 10000000 lines.
python -m cProfile read.py test.data 20000009 function calls in 42.494 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 bisect.py:1(<module>)
1 0.000 0.000 0.001 0.001 collections.py:1(<module>)
1 0.000 0.000 0.000 0.000 collections.py:25(OrderedDict)
1 0.000 0.000 0.000 0.000 collections.py:386(Counter)
1 0.000 0.000 0.000 0.000 heapq.py:31(<module>)
1 0.000 0.000 0.000 0.000 keyword.py:11(<module>)
1 30.727 30.727 42.494 42.494 read.py:2(<module>)
10000000 4.855 0.000 4.855 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
10000000 6.912 0.000 6.912 0.000 {method 'split of 'str' objects}
1 0.000 0.000 0.000 0.000 {open}
Update. We can assume that parts[1] is an integer and that parts[0] is a short fixed length string.
My fake data isn't very good as you only get one value per key. Here is a better version.
perl -E 'say int rand 1e7, $", int rand 1e4 for 1 .. 1e7' > largefile.txt
The only operation I will do is to query a key to return the list of values associated with it.
If you want the thing you said in the comment, then you can do it easily in pandas: Let's say you have a file with the same layout but the entries get duplicated, since in your example you add all the duplicates into a list:
1 1
2 2
1 3
3 4
1 5
5 6
Then you can read and manipulate the data:
In [1]: df = pd.read_table('largefile.txt', header=None, index_col=0)
In [2]: df.loc[2]
Out[2]:
1 2
Name: 2, dtype: int64
In [3]: df.loc[1]
Out[3]:
1
0
1 1
1 3
1 5
Pandas stores everything in DataFrames and Series objects which are indexed so don't bother a lot about the output, the first column is the index and the second column is the important one and it will give you the numbers you need.
You can do a lot more with pandas though... For example you can group by the first column in your file and perform aggregations:
In [64]: df = pd.read_table('largefile.txt', header=None).groupby(0)
In [65]: df.sum()
Out[65]:
1
0
1 9
2 2
3 4
5 6
In [66]: df.mean()
Out[66]:
1
0
1 3
2 2
3 4
5 6
In [67]: df[0].count()
Out[67]:
0
1 3
2 1
3 1
5 1
dtype: int64
I know that this is not the answer to how to speed up the dictionary thing, but from what you mentioned in the comment, this could be an alternate solution.
Edit - Added timing
Compared to the fastest dictionary solution and loading data into pandas DataFrame:
test_dict.py
import sys
d = {}
with open(sys.argv[1]) as fin:
for line in fin:
parts = line.split(None, 1)
d[parts[0]] = d.get(parts[0], []) + [parts[1]]
test_pandas.py
import sys
import pandas as pd
df = pd.read_table(sys.argv[1], header=None, index_col=0)
Timed on a linux machine:
$ time python test_dict.py largefile.txt
real 1m13.794s
user 1m10.148s
sys 0m3.075s
$ time python test_pandas.py largefile.txt
real 0m10.937s
user 0m9.819s
sys 0m0.504s
Edit: for the new example file
In [1]: import pandas as pd
In [2]: df = pd.read_table('largefile.txt', header=None,
sep=' ', index_col=0).sort_index()
In [3]: df.index
Out[3]: Int64Index([0, 1, 1, ..., 9999998, 9999999, 9999999], dtype=int64)
In [4]: df[1][0]
Out[4]: 6301
In [5]: df[1][1].values
Out[5]: array([8936, 5983])
Here are a few quick performance improvements I managed to get:
Using a plain dict
instead of defaultdict
, and changing d[parts[0]].append(parts[1])
to d[parts[0]] = d.get(parts[0], []) + [parts[1]]
, cut the time by 10%. I don't know whether it's eliminating all those calls to a Python __missing__
function, not mutating the lists in-place, or something else that deserves the credit.
Just using setdefault
on a plain dict
instead of defaultdict
also cuts the time by 8%, which implies that it's the extra dict work rather than the in-place appends.
Meanwhile, replacing the split()
with split(None, 1)
helps by 9%.
Running in PyPy 1.9.0 instead of CPython 2.7.2 cut the time by 52%; PyPy 2.0b by 55%.
If you can't use PyPy, CPython 3.3.0 cut the time by 9%.
Running in 32-bit mode instead of 64-bit increased the time by 170%, which implies that if you're using 32-bit you may want to switch.
The fact that the dict takes over 2GB to store (slightly less in 32-bit) is probably a big part of the problem. The only real alternative is to store everything on disk. (In a real-life app you'd probably want to manage an in-memory cache, but here, you're just generating the data and quitting, which makes things simpler.) Whether this helps depends on a number of factors. I suspect that on a system with an SSD and not much RAM it'll speed things up, while on a system with a 5400rpm hard drive and 16GB of RAM (like the laptop I'm using at the moment) it won't… But depending on your system's disk cache, etc., who knows, without testing.
There's no quick&dirty way to store lists of strings in disk-based storage (shelve
will presumably waste more time with the pickling and unpickling than it saves), but changing it to just concatenate strings instead and using gdbm
kept the memory usage down below 200MB and finished in about the same time, and has the nice side effect that if you want to use the data more than once, you've got them stored persistently. Unfortunately, plain old dbm
wouldn't work because the default page size is too small for this many entries, and the Python interface doesn't provide any way to override the default.
Switching to a simple sqlite3 database that just has non-unique Key and Value columns and doing it in :memory:
took about 80% longer, while on disk it took 85% longer. I suspect that denormalizing things to store multiple values with each key wouldn't help, and would in fact make things worse. (Still, for many real life uses, this may be a better solution.)
Meanwhile, wrapping cProfile
around your main loop:
40000002 function calls in 31.222 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 21.770 21.770 31.222 31.222 <string>:2(<module>)
20000000 2.373 0.000 2.373 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
20000000 7.079 0.000 7.079 0.000 {method 'split' of 'str' objects}
So, that's one third of your time spent in string.split
, 10% spent in append
, and the rest spend it code that cProfile
couldn't see, which here includes both iterating the file and the defaultdict
method calls.
Switching to a regular dict
with setdefault
(which, remember, was a little faster) shows 3.774 seconds spent in setdefault
, so that's about 15% of the time, or presumably around 20% for the defaultdict
version. Presuambly the __setitem__
method isn't going to be any worse than the setdefault
or defaultdict.__getitem__
were.
However, we may not be seeing the time charged by malloc calls here, and they may be a huge chunk of the performance. To test that, you'll need a C-level profiler. So let's come back to that.
Meanwhile, at least some of the leftover time is probably taken up by the line-splitting as well, since that must cost on the same order as space-splitting, right? But I don't know of any way to improve that significantly.
Finally, a C-level profiler is going to help here, but one run on my system may not help much for your system, so I'll leave that to you.
The fastest version on my system depends on which Python I run, but it's either this:
d = {}
for line in fin:
parts = line.split(None, 1)
d[parts[0]] = d.get(parts[0], []) + [parts[1]]
Or this:
d = {}
for line in fin:
parts = line.split(None, 1)
d.setdefault(parts[0], []).append(parts[1])
… And they're both pretty close to each other.
The gdbm solution, which was about the same speed, and has obvious advantages and disadvantages, looks like this:
d = gdbm.open(sys.argv[1] + '.db', 'c')
for line in fin:
parts = line.split(None, 1)
d[parts[0]] = d.get(parts[0], '') + ',' + parts[1]
(Obviously if you want to be able to run this repeatedly, you will need to add a line to delete any pre-existing database—or, better, if it fits your use case, to check its timestamp against the input file's and skip the whole loop if it's already up-to-date.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With