Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python dictionary lookup speed with NumPy datatypes

BACKGROUND

I have a lot of numeric message codes in a NumPy array, and I'd need to convert them into strings fast. I have had some problems with the performance and would like to understand why and how to make it quick.

SOME BENCHMARKS

I - The trivial approach

import numpy as np

# dictionary to use as the lookup dictionary
lookupdict = {
     1: "val1",
     2: "val2",
    27: "val3",
    35: "val4",
    59: "val5" }

# some test data
arr = np.random.choice(lookupdict.keys(), 1000000)

# create a list of words looked up
res = [ lookupdict[k] for k in arr ]

The dictionary lookup takes the better part of my coffee break, 758 ms. (I also tried res = map(lookupdict.get, arr) but that's even worse.)

II - Without NumPy

import random

# dictionary to use as the lookup dictionary
lookupdict = {
     1: "val1",
     2: "val2",
    27: "val3",
    35: "val4",
    59: "val5" }

# some test data
arr = [ random.choice(lookupdict.keys()) for _ in range(1000000) ]

# create a list of words looked up
res = [ lookupdict[k] for k in arr ]

The timing results change quite considerably to 76 ms!

It should be noted that I am interested in timing the lookup. The random generation is just to create some test data. It is not interesting whether it takes a lot of time or not. All benchmark results given here are only for the one million lookups.

III - Convert NumPy array to a list

My first guess was that this has something to do with list vs. array problems. However, by modifying the NumPy version to use lists:

res = [ lookupdict[k] for k in list(arr) ]

gives me 778 ms, of which around 110 ms is spent converting the list and 570 ms doing the lookup. So, the lookup is a bit faster, but the total time is the same.

IV - Type conversion from np.int32 to int

As the only other difference seems to be the data type (np.int32 vs. int), I tried converting the types on-the-fly. This is a bit stupid, as probably dict does the same:

res = [ lookupdict[int(k)] for k in arr ]

However, this seems to do something interesting, because the time drops to 266 ms. It seems that almost-but-not-quite-the-same datatypes play nasty tricks with dictionary lookups and that dict code is not very efficient with conversions.

V - Dictionary key conversion to np.int32

To test this, I modified the NumPy version to use exactly the same data type in dict keys and lookup:

import numpy as np

# dictionary to use as the lookup dictionary
lookupdict = {
     np.int32(1): "val1",
     np.int32(2): "val2",
    np.int32(27): "val3",
    np.int32(35): "val4",
    np.int32(59): "val5" }

# some test data
arr = np.random.choice(lookupdict.keys(), 1000000)

# create a list of words looked up
res = [ lookupdict[k] for k in arr ]

This improved to 177 ms. Not an insignificant improvement but a far cry form the 76 ms.

VI - Array conversion to use int objects

import numpy as np

# dictionary to use as the lookup dictionary
lookupdict = {
     1: "val1",
     2: "val2",
    27: "val3",
    35: "val4",
    59: "val5" }

# some test data
arr = np.array([ random.choice(lookupdict.keys()) for _ in range(1000000) ], 
               dtype='object')

# create a list of words looked up
res = [ lookupdict[k] for k in arr ]

This gives 86 ms, which is already very close to the native Python 76 ms.

Result summary

  1. dict keys int, indexing with int (native Python): 76 ms
  2. dict keys int, indexing with int objects (NumPy): 86 ms
  3. dict keys np.int32, indexing with np.int32: 177 ms
  4. dict keys int, indexing with np.int32: 758 ms

QUESTION(S)

Why? And what can I do to make the dictionary lookups as fast as possible? My input data is a NumPy array, so the best (fastest but ugly) this far is to convert the dict keys into np.int32. (Unfortunately, the dict keys may be spread over a wide range of numbers, so indexing array-by-array is not a viable alternative. Fast it would be though, 10 ms.)

like image 434
DrV Avatar asked Jul 11 '14 20:07

DrV


2 Answers

This is interesting, I may have found an answer to my question.

The alternative III was to convert the array into a list. It seems that this provides very good results if done the right way. This:

res = [ lookupdict[k] for k in list(arr) ]

clocks 778 ms.

But this:

res = [ lookupdict[k] for k in arr.tolist() ]

clocks 86 ms.

The technical explanation behind this being that arr.tolist converts the array into int objects, whereas list(arr) creates a list of np.int32 objects.

like image 90
DrV Avatar answered Sep 18 '22 09:09

DrV


In my timings, your II - Without NumPy is quite a bit slower than I

In [11]: timeit [lookupdict[k] for k in np.random.choice(lookupdict.keys(),1000000)]
1 loops, best of 3: 658 ms per loop

In [12]: timeit [lookupdict[k] for k in [np.random.choice(lookupdict.keys()) for _ in range(1000000)]]
1 loops, best of 3: 8.04 s per loop

But if skip the lookup by making the choice on the values, you gain more time

In [34]: timeit np.random.choice(lookupdict.values(),1000000)
10 loops, best of 3: 85.3 ms per loop

OK, lets focus on the lookup:

In [26]: arr =np.random.choice(lookupdict.keys(),1000000)

In [27]: arrlist=arr.tolist()

In [28]: timeit res = [lookupdict[k] for k in arr]
1 loops, best of 3: 583 ms per loop

In [29]: timeit res = [lookupdict[k] for k in arrlist]
10 loops, best of 3: 120 ms per loop

In [30]: timeit res = [lookupdict[k] for k in list(arr)]
1 loops, best of 3: 675 ms per loop

In [31]: timeit res = [lookupdict[k] for k in arr.tolist()]
10 loops, best of 3: 156 ms per loop

In [32]: timeit res = [k for k in arr]
1 loops, best of 3: 215 ms per loop

In [33]: timeit res = [k for k in arrlist]
10 loops, best of 3: 51.4 ms per loop

In [42]: timeit arr.tolist()
10 loops, best of 3: 33.6 ms per loop

In [43]: timeit list(arr)
1 loops, best of 3: 264 ms per loop

First observation - iteration over an np.array is slower than iteration over the equivalent list

Second - list(arr) is slower the arr.tolist(). list() appears to have 2 problems. By itself it is slower, and the items are np.int32.

like image 30
hpaulj Avatar answered Sep 19 '22 09:09

hpaulj