I have a very large (200k+) set of key/value pairs, for which I need to retrieve very large (sometimes all) of the values. The obvious way to do this is with a dictionary such that
values = {lookup.get(key) for key in key_set}
This is getting very time consuming in my code, and I'm wondering if there's a faster way to implement this with a NumPy array. I've been experimenting with using an array with two columns and n rows, such that for any individual key:
value = lookup_array[lookup_array[:,0] == key, 1]
But I'm not sure how to scale this to many keys up without costly iteration. I've looked at:
values = lookup_array[np.in1d(lookup_array[:,0], key_set), 1]
but this also seems time consuming.
Is there any other way to do a massive lookup of nonconsecutive values quickly without iterating?
If certain special conditions apply, you can use NumPy indexing as a very fast alternative to dictionary lookups.
The keys must be integers
You have enough memory to create a NumPy array whose size is as big as the maximum key value you wish to look up (so that all keys correspond to a valid index into the array.)
The idea is to use
lookup_array = np.empty((M,), dtype=values.dtype)
lookup_array[keys] = values
result = lookup_array[key_set]
instead of
result = {lookup_dict.get(key) for key in key_set}
For example,
import numpy as np
import pandas as pd
def using_dict(lookup_dict, key_set):
return {lookup_dict.get(key) for key in key_set}
def using_array(lookup_array, key_set):
return lookup_array[key_set]
def using_pandas(df, key_set):
return df.loc[df['a'].isin(key_set)]
M = 10**6
N = 2*10**5
K = 10**4
keys = np.random.randint(M, size=(N,))
values = np.random.random((N,))
lookup_dict = dict(zip(keys, values))
lookup_array = np.empty((M,), dtype=values.dtype)
lookup_array[keys] = values
df = pd.DataFrame(np.column_stack([keys, values]), columns=list('ab'))
key_set = np.random.choice(keys, size=(K,))
And here is a timeit benchmark (using IPython) for the methods above:
In [25]: %timeit using_array(lookup_array, key_set)
10000 loops, best of 3: 22.4 µs per loop
In [26]: %timeit using_dict(lookup_dict, key_set)
100 loops, best of 3: 3.73 ms per loop
In [24]: %timeit using_pandas(df, key_set)
10 loops, best of 3: 38.9 ms per loop
Here's an approach with np.searchsorted
-
row_idx = np.searchsorted(lookup_array[:,0],key_set)[key_set.argsort()]
values = lookup_array[row_idx,1]
This assumes that lookup_array
has the keys sorted in its first column. If that's not the case, you can use the optional sorter argument with np.searchsorted
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With