I have a list of names (strings) divided into words. There are 8 million names, each name consists of up to 20 words (tokens). Number of unique tokens is 2.2 million. I need an efficient way to find all names containing at least one word from the query (which may contain also up to 20 words, but usually only a few).
My current approach uses Python Pandas and looks like this (later referred as original
):
>>> df = pd.DataFrame([['foo', 'bar', 'joe'],
['foo'],
['bar', 'joe'],
['zoo']],
index=['id1', 'id2', 'id3', 'id4'])
>>> df.index.rename('id', inplace=True) # btw, is there a way to include this into prev line?
>>> print df
0 1 2
id
id1 foo bar joe
id2 foo None None
id3 bar joe None
id4 zoo None None
def filter_by_tokens(df, tokens):
# search within each column and then concatenate and dedup results
results = [df.loc[lambda df: df[i].isin(tokens)] for i in range(df.shape[1])]
return pd.concat(results).reset_index().drop_duplicates().set_index(df.index.name)
>>> print filter_by_tokens(df, ['foo', 'zoo'])
0 1 2
id
id1 foo bar joe
id2 foo None None
id4 zoo None None
Currently such lookup (on the full dataset) takes 5.75s on my (rather powerful) machine. I'd like to speed it up at least, say, 10 times.
I was able to get to 5.29s by squeezing all columns into one and perform a lookup on that (later referred as original, squeezed
):
>>> df = pd.Series([{'foo', 'bar', 'joe'},
{'foo'},
{'bar', 'joe'},
{'zoo'}],
index=['id1', 'id2', 'id3', 'id4'])
>>> df.index.rename('id', inplace=True)
>>> print df
id
id1 {foo, bar, joe}
id2 {foo}
id3 {bar, joe}
id4 {zoo}
dtype: object
def filter_by_tokens(df, tokens):
return df[df.map(lambda x: bool(x & set(tokens)))]
>>> print filter_by_tokens(df, ['foo', 'zoo'])
id
id1 {foo, bar, joe}
id2 {foo}
id4 {zoo}
dtype: object
But that's still not fast enough.
Another solution which seems to be easy to implement is to use Python multiprocessing (threading shouldn't help here because of GIL and there is no I/O, right?). But the problem with it is that the big dataframe needs to be copied to each process, which takes up all the memory. Another problem is that I need to call filter_by_tokens
many times in a loop, so it would copy the dataframe on every call, which is inefficient.
Note that words may occur many times in names (e.g. the most popular word occurs 600k times in names), so a reverse index would be huge.
What is a good way to write this efficiently? Python solution preferred, but I'm also open to other languages and technologies (e.g. databases).
UPD: I've measured execution time of my two solutions and the 5 solutions suggested by @piRSquared in his answer. Here are the results (tl;dr the best is 2x improvement):
+--------------------+----------------+
| method | best of 3, sec |
+--------------------+----------------+
| original | 5.75 |
| original, squeezed | 5.29 |
| zip | 2.54 |
| merge | 8.87 |
| mul+any | MemoryError |
| isin | IndexingError |
| query | 3.7 |
+--------------------+----------------+
mul+any
gives MemoryError on d1 = pd.get_dummies(df.stack()).groupby(level=0).sum()
(on a 128Gb RAM machine).
isin
gives IndexingError: Unalignable boolean Series key provided
on s[d1.isin({'zoo', 'foo'}).unstack().any(1)]
, apparently because shape of df.stack().isin(set(tokens)).unstack()
is slightly less than the shape of the original dataframe (8.39M vs 8.41M rows), not sure why and how to fix that.
Note that the machine I'm using has 12 cores (though I mentioned some problems with parallelization above). All of the solutions utilize a single core.
Conclusion (as of now): there is 2.1x improvement by zip
(2.54s) vs original squeezed
solution (5.29s). It's good, though I aimed for at least 10x improvement, if possible. So I'm leaving the (still great) @piRSquared answer unaccepted for now, to welcome more suggestions.
idea 0zip
def pir(s, token):
return s[[bool(p & token) for p in s]]
pir(s, {'foo', 'zoo'})
idea 1merge
token = pd.DataFrame(dict(v=['foo', 'zoo']))
d1 = df.stack().reset_index('id', name='v')
s.ix[d1.merge(token).id.unique()]
idea 2mul
+ any
d1 = pd.get_dummies(df.stack()).groupby(level=0).sum()
token = pd.Series(1, ['foo', 'zoo'])
s[d1.mul(token).any(1)]
idea 3isin
d1 = df.stack()
s[d1.isin({'zoo', 'foo'}).unstack().any(1)]
idea 4query
token = ('foo', 'zoo')
d1 = df.stack().to_frame('s')
s.ix[d1.query('s in @token').index.get_level_values(0).unique()]
I have done similar things with the following tools
Hbase - Key can have Multiple columns (Very Fast)
ElasticSearch - Nice easy to scale. You just need to import your data as JSON
Apache Lucene - Will be very good for 8 Million records
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With