I have a very large (200k+) set of key/value pairs, for which I need to retrieve very large (sometimes all) of the values. The obvious way to do this is with a dictionary such that
 values = {lookup.get(key) for key in key_set}
This is getting very time consuming in my code, and I'm wondering if there's a faster way to implement this with a NumPy array. I've been experimenting with using an array with two columns and n rows, such that for any individual key:
value = lookup_array[lookup_array[:,0] == key, 1]
But I'm not sure how to scale this to many keys up without costly iteration. I've looked at:
values = lookup_array[np.in1d(lookup_array[:,0], key_set), 1]
but this also seems time consuming.
Is there any other way to do a massive lookup of nonconsecutive values quickly without iterating?
If certain special conditions apply, you can use NumPy indexing as a very fast alternative to dictionary lookups.
The keys must be integers
You have enough memory to create a NumPy array whose size is as big as the maximum key value you wish to look up (so that all keys correspond to a valid index into the array.)
The idea is to use
lookup_array = np.empty((M,), dtype=values.dtype)
lookup_array[keys] = values
result = lookup_array[key_set]
instead of
result = {lookup_dict.get(key) for key in key_set}
For example,
import numpy as np
import pandas as pd
def using_dict(lookup_dict, key_set):
    return {lookup_dict.get(key) for key in key_set}
def using_array(lookup_array, key_set):
    return lookup_array[key_set]
def using_pandas(df, key_set):
    return df.loc[df['a'].isin(key_set)]
M = 10**6
N = 2*10**5
K = 10**4
keys = np.random.randint(M, size=(N,))
values = np.random.random((N,))
lookup_dict = dict(zip(keys, values))
lookup_array = np.empty((M,), dtype=values.dtype)
lookup_array[keys] = values
df = pd.DataFrame(np.column_stack([keys, values]), columns=list('ab'))
key_set = np.random.choice(keys, size=(K,))
And here is a timeit benchmark (using IPython) for the methods above:
In [25]: %timeit using_array(lookup_array, key_set)
10000 loops, best of 3: 22.4 µs per loop
In [26]: %timeit using_dict(lookup_dict, key_set)
100 loops, best of 3: 3.73 ms per loop
In [24]: %timeit using_pandas(df, key_set)
10 loops, best of 3: 38.9 ms per loop
                        Here's an approach with np.searchsorted -
row_idx = np.searchsorted(lookup_array[:,0],key_set)[key_set.argsort()]
values = lookup_array[row_idx,1]
This assumes that lookup_array has the keys sorted in its first column. If that's not the case, you can use the optional sorter argument with np.searchsorted.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With