Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient lookup by common words

I have a list of names (strings) divided into words. There are 8 million names, each name consists of up to 20 words (tokens). Number of unique tokens is 2.2 million. I need an efficient way to find all names containing at least one word from the query (which may contain also up to 20 words, but usually only a few).

My current approach uses Python Pandas and looks like this (later referred as original):

>>> df = pd.DataFrame([['foo', 'bar', 'joe'], 
                       ['foo'], 
                       ['bar', 'joe'], 
                       ['zoo']], 
                      index=['id1', 'id2', 'id3', 'id4'])
>>> df.index.rename('id', inplace=True)  # btw, is there a way to include this into prev line?
>>> print df

       0     1     2
id                  
id1  foo   bar   joe
id2  foo  None  None
id3  bar   joe  None
id4  zoo  None  None

def filter_by_tokens(df, tokens):
    # search within each column and then concatenate and dedup results    
    results = [df.loc[lambda df: df[i].isin(tokens)] for i in range(df.shape[1])]
    return pd.concat(results).reset_index().drop_duplicates().set_index(df.index.name)

>>> print filter_by_tokens(df, ['foo', 'zoo'])

       0     1     2
id                  
id1  foo   bar   joe
id2  foo  None  None
id4  zoo  None  None    

Currently such lookup (on the full dataset) takes 5.75s on my (rather powerful) machine. I'd like to speed it up at least, say, 10 times.

I was able to get to 5.29s by squeezing all columns into one and perform a lookup on that (later referred as original, squeezed):

>>> df = pd.Series([{'foo', 'bar', 'joe'}, 
                    {'foo'}, 
                    {'bar', 'joe'}, 
                    {'zoo'}], 
                    index=['id1', 'id2', 'id3', 'id4'])
>>> df.index.rename('id', inplace=True)
>>> print df

id
id1    {foo, bar, joe}
id2              {foo}
id3         {bar, joe}
id4              {zoo}
dtype: object

def filter_by_tokens(df, tokens):
    return df[df.map(lambda x: bool(x & set(tokens)))]

>>> print filter_by_tokens(df, ['foo', 'zoo'])

id
id1    {foo, bar, joe}
id2              {foo}
id4              {zoo}
dtype: object

But that's still not fast enough.

Another solution which seems to be easy to implement is to use Python multiprocessing (threading shouldn't help here because of GIL and there is no I/O, right?). But the problem with it is that the big dataframe needs to be copied to each process, which takes up all the memory. Another problem is that I need to call filter_by_tokens many times in a loop, so it would copy the dataframe on every call, which is inefficient.

Note that words may occur many times in names (e.g. the most popular word occurs 600k times in names), so a reverse index would be huge.

What is a good way to write this efficiently? Python solution preferred, but I'm also open to other languages and technologies (e.g. databases).


UPD: I've measured execution time of my two solutions and the 5 solutions suggested by @piRSquared in his answer. Here are the results (tl;dr the best is 2x improvement):

+--------------------+----------------+
|       method       | best of 3, sec |
+--------------------+----------------+
| original           | 5.75           |
| original, squeezed | 5.29           |
| zip                | 2.54           |
| merge              | 8.87           |
| mul+any            | MemoryError    |
| isin               | IndexingError  |
| query              | 3.7            |
+--------------------+----------------+

mul+any gives MemoryError on d1 = pd.get_dummies(df.stack()).groupby(level=0).sum() (on a 128Gb RAM machine).

isin gives IndexingError: Unalignable boolean Series key provided on s[d1.isin({'zoo', 'foo'}).unstack().any(1)], apparently because shape of df.stack().isin(set(tokens)).unstack() is slightly less than the shape of the original dataframe (8.39M vs 8.41M rows), not sure why and how to fix that.

Note that the machine I'm using has 12 cores (though I mentioned some problems with parallelization above). All of the solutions utilize a single core.

Conclusion (as of now): there is 2.1x improvement by zip (2.54s) vs original squeezed solution (5.29s). It's good, though I aimed for at least 10x improvement, if possible. So I'm leaving the (still great) @piRSquared answer unaccepted for now, to welcome more suggestions.

like image 520
Dennis Golomazov Avatar asked Oct 20 '16 01:10

Dennis Golomazov


2 Answers

idea 0
zip

def pir(s, token):
    return s[[bool(p & token) for p in s]]

pir(s, {'foo', 'zoo'})

idea 1
merge

token = pd.DataFrame(dict(v=['foo', 'zoo']))
d1 = df.stack().reset_index('id', name='v')
s.ix[d1.merge(token).id.unique()]

idea 2
mul + any

d1 = pd.get_dummies(df.stack()).groupby(level=0).sum()
token = pd.Series(1, ['foo', 'zoo'])
s[d1.mul(token).any(1)]

idea 3
isin

d1 = df.stack()
s[d1.isin({'zoo', 'foo'}).unstack().any(1)]

idea 4
query

token = ('foo', 'zoo')
d1 = df.stack().to_frame('s')
s.ix[d1.query('s in @token').index.get_level_values(0).unique()]
like image 118
piRSquared Avatar answered Sep 18 '22 06:09

piRSquared


I have done similar things with the following tools

Hbase - Key can have Multiple columns (Very Fast)
ElasticSearch - Nice easy to scale. You just need to import your data as JSON

Apache Lucene - Will be very good for 8 Million records

like image 37
Tim Seed Avatar answered Sep 21 '22 06:09

Tim Seed