Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find occurrences of huge list of phrases in text

Tags:

python

I'm building a backend and trying to crunch the following problem.

  • The clients submit text to the backend (around 2000 characters on average)
  • Backend endpoint that receives the request has to apply phrase highlighting to the submitted text
  • There is around 80k phrases to match. A phrase is a simple object:

    {
        'phrase': 'phrase to match'
        'link': 'link_url'
    }
    
  • After finding all matches of phrases that exist in the text, the backend returns to the client what was matched - basically a map:

    range in text -> phrase
    

Most is done. I'm about to tackle coding the phrase matching part. Everything else works smoothly. Since I don't want to reinvent the wheel I tried googling to find a Python library that does the job of efficiently finding phrases (from huge list) in text. However, I couldn't find anything.

I checked out the BlueSoup and Natural Language Toolkit. However they don't seem to be doing what I'm looking for.

Do you guys know if there is a library that would be helpful in such task? Seems like a common thing to implement and I don't want to go custom if there is a well established library for that.

like image 542
Rafa de King Avatar asked Mar 08 '18 13:03

Rafa de King


Video Answer


2 Answers

Maybe you should try flashtext.
According to the author, it is much more faster than Regex.
enter image description here

The author even published a paper for this library.

I've personally tried this library for one of my project, in my opinion its API is quite friendly and usable.

Hope it helps.

like image 54
Wong Jia Hau Avatar answered Sep 23 '22 00:09

Wong Jia Hau


To get a reasonable speed while matching 80k patterns, you definitely need some preprocessing on the patterns, single-shot algorithms like Boyer-Moore won't help much.

You'll probably also need to do the work in compiled code (think C extension) to get reasonable throughput. Regarding how to preprocess the patterns - one option is state machines like Aho-Corasick or some generic finite state transducer. The next option is something like a suffix array based index, and the last one that comes to my mind is inverted index.

If your matches are exact and the patterns respect word boundaries, chances are that a well implemented word or word-ngram keyed inverted index will be fast enough even in pure Python. The index is not a complete solution, it will rather give you a few candidate phrases which you need to check with normal string matching for a complete match.

If you need approximate matching, character-ngram inverted index is your choice.

Regarding real implementations - flashtext mentioned in other answer here seems to be a reasonable pure Python solution if you're OK with the full-phrase-only limitation.

Otherwise you can get reasonable results with generic multi-pattern capable regexp libraries: one of the fastest should be Intel's hyperscan - there are even some rudimentary python bindings available.

Other option is Google's RE2 with Python bindings from Facebook. You want to use RE2::Set in this case.

like image 23
liborm Avatar answered Sep 25 '22 00:09

liborm