BACKGROUND
I have a lot of numeric message codes in a NumPy array, and I'd need to convert them into strings fast. I have had some problems with the performance and would like to understand why and how to make it quick.
SOME BENCHMARKS
I - The trivial approach
import numpy as np
# dictionary to use as the lookup dictionary
lookupdict = {
1: "val1",
2: "val2",
27: "val3",
35: "val4",
59: "val5" }
# some test data
arr = np.random.choice(lookupdict.keys(), 1000000)
# create a list of words looked up
res = [ lookupdict[k] for k in arr ]
The dictionary lookup takes the better part of my coffee break, 758 ms. (I also tried res = map(lookupdict.get, arr)
but that's even worse.)
II - Without NumPy
import random
# dictionary to use as the lookup dictionary
lookupdict = {
1: "val1",
2: "val2",
27: "val3",
35: "val4",
59: "val5" }
# some test data
arr = [ random.choice(lookupdict.keys()) for _ in range(1000000) ]
# create a list of words looked up
res = [ lookupdict[k] for k in arr ]
The timing results change quite considerably to 76 ms!
It should be noted that I am interested in timing the lookup. The random generation is just to create some test data. It is not interesting whether it takes a lot of time or not. All benchmark results given here are only for the one million lookups.
III - Convert NumPy array to a list
My first guess was that this has something to do with list vs. array problems. However, by modifying the NumPy version to use lists:
res = [ lookupdict[k] for k in list(arr) ]
gives me 778 ms, of which around 110 ms is spent converting the list and 570 ms doing the lookup. So, the lookup is a bit faster, but the total time is the same.
IV - Type conversion from np.int32
to int
As the only other difference seems to be the data type (np.int32
vs. int
), I tried converting the types on-the-fly. This is a bit stupid, as probably dict does the same:
res = [ lookupdict[int(k)] for k in arr ]
However, this seems to do something interesting, because the time drops to 266 ms. It seems that almost-but-not-quite-the-same datatypes play nasty tricks with dictionary lookups and that dict code is not very efficient with conversions.
V - Dictionary key conversion to np.int32
To test this, I modified the NumPy version to use exactly the same data type in dict keys and lookup:
import numpy as np
# dictionary to use as the lookup dictionary
lookupdict = {
np.int32(1): "val1",
np.int32(2): "val2",
np.int32(27): "val3",
np.int32(35): "val4",
np.int32(59): "val5" }
# some test data
arr = np.random.choice(lookupdict.keys(), 1000000)
# create a list of words looked up
res = [ lookupdict[k] for k in arr ]
This improved to 177 ms. Not an insignificant improvement but a far cry form the 76 ms.
VI - Array conversion to use int
objects
import numpy as np
# dictionary to use as the lookup dictionary
lookupdict = {
1: "val1",
2: "val2",
27: "val3",
35: "val4",
59: "val5" }
# some test data
arr = np.array([ random.choice(lookupdict.keys()) for _ in range(1000000) ],
dtype='object')
# create a list of words looked up
res = [ lookupdict[k] for k in arr ]
This gives 86 ms, which is already very close to the native Python 76 ms.
Result summary
int
, indexing with int
(native Python): 76 msint
, indexing with int
objects (NumPy): 86 msnp.int32
, indexing with np.int32
: 177 msint
, indexing with np.int32
: 758 msQUESTION(S)
Why? And what can I do to make the dictionary lookups as fast as possible? My input data is a NumPy array, so the best (fastest but ugly) this far is to convert the dict keys into np.int32
. (Unfortunately, the dict keys may be spread over a wide range of numbers, so indexing array-by-array is not a viable alternative. Fast it would be though, 10 ms.)
This is interesting, I may have found an answer to my question.
The alternative III was to convert the array into a list. It seems that this provides very good results if done the right way. This:
res = [ lookupdict[k] for k in list(arr) ]
clocks 778 ms.
But this:
res = [ lookupdict[k] for k in arr.tolist() ]
clocks 86 ms.
The technical explanation behind this being that arr.tolist
converts the array into int
objects, whereas list(arr)
creates a list of np.int32
objects.
In my timings, your II - Without NumPy
is quite a bit slower than I
In [11]: timeit [lookupdict[k] for k in np.random.choice(lookupdict.keys(),1000000)]
1 loops, best of 3: 658 ms per loop
In [12]: timeit [lookupdict[k] for k in [np.random.choice(lookupdict.keys()) for _ in range(1000000)]]
1 loops, best of 3: 8.04 s per loop
But if skip the lookup by making the choice
on the values, you gain more time
In [34]: timeit np.random.choice(lookupdict.values(),1000000)
10 loops, best of 3: 85.3 ms per loop
OK, lets focus on the lookup:
In [26]: arr =np.random.choice(lookupdict.keys(),1000000)
In [27]: arrlist=arr.tolist()
In [28]: timeit res = [lookupdict[k] for k in arr]
1 loops, best of 3: 583 ms per loop
In [29]: timeit res = [lookupdict[k] for k in arrlist]
10 loops, best of 3: 120 ms per loop
In [30]: timeit res = [lookupdict[k] for k in list(arr)]
1 loops, best of 3: 675 ms per loop
In [31]: timeit res = [lookupdict[k] for k in arr.tolist()]
10 loops, best of 3: 156 ms per loop
In [32]: timeit res = [k for k in arr]
1 loops, best of 3: 215 ms per loop
In [33]: timeit res = [k for k in arrlist]
10 loops, best of 3: 51.4 ms per loop
In [42]: timeit arr.tolist()
10 loops, best of 3: 33.6 ms per loop
In [43]: timeit list(arr)
1 loops, best of 3: 264 ms per loop
First observation - iteration over an np.array
is slower than iteration over the equivalent list
Second - list(arr)
is slower the arr.tolist()
. list()
appears to have 2 problems. By itself it is slower, and the items are np.int32
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With