Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Looking up large sets of keys: dictionary vs. NumPy array

I have a very large (200k+) set of key/value pairs, for which I need to retrieve very large (sometimes all) of the values. The obvious way to do this is with a dictionary such that

 values = {lookup.get(key) for key in key_set}

This is getting very time consuming in my code, and I'm wondering if there's a faster way to implement this with a NumPy array. I've been experimenting with using an array with two columns and n rows, such that for any individual key:

value = lookup_array[lookup_array[:,0] == key, 1]

But I'm not sure how to scale this to many keys up without costly iteration. I've looked at:

values = lookup_array[np.in1d(lookup_array[:,0], key_set), 1]

but this also seems time consuming.

Is there any other way to do a massive lookup of nonconsecutive values quickly without iterating?

like image 970
triphook Avatar asked Apr 15 '16 16:04

triphook


2 Answers

If certain special conditions apply, you can use NumPy indexing as a very fast alternative to dictionary lookups.

  • The keys must be integers

  • You have enough memory to create a NumPy array whose size is as big as the maximum key value you wish to look up (so that all keys correspond to a valid index into the array.)

The idea is to use

lookup_array = np.empty((M,), dtype=values.dtype)
lookup_array[keys] = values
result = lookup_array[key_set]

instead of

result = {lookup_dict.get(key) for key in key_set}

For example,

import numpy as np
import pandas as pd

def using_dict(lookup_dict, key_set):
    return {lookup_dict.get(key) for key in key_set}

def using_array(lookup_array, key_set):
    return lookup_array[key_set]

def using_pandas(df, key_set):
    return df.loc[df['a'].isin(key_set)]

M = 10**6
N = 2*10**5
K = 10**4
keys = np.random.randint(M, size=(N,))
values = np.random.random((N,))
lookup_dict = dict(zip(keys, values))
lookup_array = np.empty((M,), dtype=values.dtype)
lookup_array[keys] = values
df = pd.DataFrame(np.column_stack([keys, values]), columns=list('ab'))
key_set = np.random.choice(keys, size=(K,))

And here is a timeit benchmark (using IPython) for the methods above:

In [25]: %timeit using_array(lookup_array, key_set)
10000 loops, best of 3: 22.4 µs per loop

In [26]: %timeit using_dict(lookup_dict, key_set)
100 loops, best of 3: 3.73 ms per loop

In [24]: %timeit using_pandas(df, key_set)
10 loops, best of 3: 38.9 ms per loop
like image 185
unutbu Avatar answered Oct 20 '22 03:10

unutbu


Here's an approach with np.searchsorted -

row_idx = np.searchsorted(lookup_array[:,0],key_set)[key_set.argsort()]
values = lookup_array[row_idx,1]

This assumes that lookup_array has the keys sorted in its first column. If that's not the case, you can use the optional sorter argument with np.searchsorted.

like image 38
Divakar Avatar answered Oct 20 '22 03:10

Divakar