I'm building a backend and trying to crunch the following problem.
2000
characters on average)There is around 80k
phrases to match. A phrase is a simple object:
{
'phrase': 'phrase to match'
'link': 'link_url'
}
After finding all matches of phrases that exist in the text, the backend returns to the client what was matched - basically a map:
range in text -> phrase
Most is done. I'm about to tackle coding the phrase matching part. Everything else works smoothly. Since I don't want to reinvent the wheel I tried googling to find a Python library that does the job of efficiently finding phrases (from huge list) in text. However, I couldn't find anything.
I checked out the BlueSoup and Natural Language Toolkit. However they don't seem to be doing what I'm looking for.
Do you guys know if there is a library that would be helpful in such task? Seems like a common thing to implement and I don't want to go custom if there is a well established library for that.
Maybe you should try flashtext.
According to the author, it is much more faster than Regex.
The author even published a paper for this library.
I've personally tried this library for one of my project, in my opinion its API is quite friendly and usable.
Hope it helps.
To get a reasonable speed while matching 80k patterns, you definitely need some preprocessing on the patterns, single-shot algorithms like Boyer-Moore
won't help much.
You'll probably also need to do the work in compiled code (think C extension) to get reasonable throughput. Regarding how to preprocess the patterns - one option is state machines like Aho-Corasick
or some generic finite state transducer. The next option is something like a suffix array
based index, and the last one that comes to my mind is inverted index.
If your matches are exact and the patterns respect word boundaries, chances are that a well implemented word or word-ngram keyed inverted index
will be fast enough even in pure Python. The index is not a complete solution, it will rather give you a few candidate phrases which you need to check with normal string matching for a complete match.
If you need approximate matching, character-ngram inverted index is your choice.
Regarding real implementations - flashtext mentioned in other answer here seems to be a reasonable pure Python solution if you're OK with the full-phrase-only limitation.
Otherwise you can get reasonable results with generic multi-pattern capable regexp libraries: one of the fastest should be Intel's hyperscan - there are even some rudimentary python bindings available.
Other option is Google's RE2 with Python bindings from Facebook. You want to use RE2::Set
in this case.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With