Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

search keywords efficiently when keywords are multi words

I needs to match a really large list of keywords (>1000000) in a string efficiently using python. I found some really good libraries which try to do this fast:

1) FlashText (https://github.com/vi3k6i5/flashtext)

2) Aho-Corasick Algorithm etc.

However I have a peculiar requirement: In my context a keyword say 'XXXX YYYY' should return a match if my string is ' XXXX is a very good indication of YYYY'. Note here that 'XXXX YYYY' is not occuring as a substring but XXXX and YYYY are present in the string and this is good enough for me for a match.

I know how to do it naively. What I am looking for is efficiency, any more good libraries for this?

like image 999
suzee Avatar asked Oct 29 '22 20:10

suzee


1 Answers

What you ask sound like a full text search task. There's Python search package called whoosh. @derek's corpus can be indexed and searched in memory like the following.

from whoosh.filedb.filestore import RamStorage
from whoosh.qparser import QueryParser
from whoosh import fields


texts = [
    "Here's a sentence with dog and apple in it",
    "Here's a sentence with dog and poodle in it",
    "Here's a sentence with poodle and apple in it",
    "Here's a dog with and apple and a poodle in it",
    "Here's an apple with a dog to show that order is irrelevant"
]

schema = fields.Schema(text=fields.TEXT(stored=True))
storage = RamStorage()
index = storage.create_index(schema)
storage.open_index()

writer = index.writer()
for t in texts:
    writer.add_document(text = t)
writer.commit()

query = QueryParser('text', schema).parse('dog apple')
results = index.searcher().search(query)

for r in results:
    print(r)

This produces:

<Hit {'text': "Here's a sentence with dog and apple in it"}>
<Hit {'text': "Here's a dog with and apple and a poodle in it"}>
<Hit {'text': "Here's an apple with a dog to show that order is irrelevant"}>

You can also persist your index using FileStorage as described in How to index documents.

like image 89
saaj Avatar answered Nov 15 '22 06:11

saaj