Looking up large sets of keys: dictionary vs. NumPy array

Question

I have a very large (200k+) set of key/value pairs, for which I need to retrieve very large (sometimes all) of the values. The obvious way to do this is with a dictionary such that

 values = {lookup.get(key) for key in key_set}

This is getting very time consuming in my code, and I'm wondering if there's a faster way to implement this with a NumPy array. I've been experimenting with using an array with two columns and n rows, such that for any individual key:

value = lookup_array[lookup_array[:,0] == key, 1]

But I'm not sure how to scale this to many keys up without costly iteration. I've looked at:

values = lookup_array[np.in1d(lookup_array[:,0], key_set), 1]

but this also seems time consuming.

Is there any other way to do a massive lookup of nonconsecutive values quickly without iterating?

unutbu · Accepted Answer

If certain special conditions apply, you can use NumPy indexing as a very fast alternative to dictionary lookups.

The keys must be integers
You have enough memory to create a NumPy array whose size is as big as the maximum key value you wish to look up (so that all keys correspond to a valid index into the array.)

The idea is to use

lookup_array = np.empty((M,), dtype=values.dtype)
lookup_array[keys] = values
result = lookup_array[key_set]

instead of

result = {lookup_dict.get(key) for key in key_set}

For example,

import numpy as np
import pandas as pd

def using_dict(lookup_dict, key_set):
    return {lookup_dict.get(key) for key in key_set}

def using_array(lookup_array, key_set):
    return lookup_array[key_set]

def using_pandas(df, key_set):
    return df.loc[df['a'].isin(key_set)]

M = 10**6
N = 2*10**5
K = 10**4
keys = np.random.randint(M, size=(N,))
values = np.random.random((N,))
lookup_dict = dict(zip(keys, values))
lookup_array = np.empty((M,), dtype=values.dtype)
lookup_array[keys] = values
df = pd.DataFrame(np.column_stack([keys, values]), columns=list('ab'))
key_set = np.random.choice(keys, size=(K,))

And here is a timeit benchmark (using IPython) for the methods above:

In [25]: %timeit using_array(lookup_array, key_set)
10000 loops, best of 3: 22.4 µs per loop

In [26]: %timeit using_dict(lookup_dict, key_set)
100 loops, best of 3: 3.73 ms per loop

In [24]: %timeit using_pandas(df, key_set)
10 loops, best of 3: 38.9 ms per loop

Divakar · Answer

Here's an approach with np.searchsorted -

row_idx = np.searchsorted(lookup_array[:,0],key_set)[key_set.argsort()]
values = lookup_array[row_idx,1]

This assumes that lookup_array has the keys sorted in its first column. If that's not the case, you can use the optional sorter argument with np.searchsorted.

Looking up large sets of keys: dictionary vs. NumPy array

Tags:

python

arrays

dictionary

iteration

numpy

triphook

2 Answers

unutbu

Divakar

Recent Activity

Donate For Us

Looking up large sets of keys: dictionary vs. NumPy array

Tags:

python

arrays

dictionary

iteration

numpy

triphook

2 Answers

unutbu

Divakar

Related questions

Recent Activity

Donate For Us