Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to match only whole words with Aho corasick?

Our ruby on rails app uses aho corasick gem to find if any given text contains any of the prelisted bad words (these are picked from a static config when loading the app).

But, using this is giving a few false positives. For example if my bad word from config is "abc", then the text containing "habcd" is also being flagged, which is not the intent.

So, I tried changing the config word from "abc" to " abc " (space added before and after the word). However, this has another drawback that a text like "abc is xyz" will not be flagged, where as it is supposed to be. So, i have to add another 2 words - "abc " and " abc" to my config as well, similarly i would need to add "-abc", "abc-", ":abc", etc. to my config, making the config pretty big, as there are many such words, apart from abc.

So, I was thinking if there is some kind of regular expression that I can enter in my config like [",-" "]abc[",-" "] so that all the above cases would be covered and no false positives will be found.

We use gem 'aho_corasick', '0.1.0' , with ruby - 1.9.3 and rails - 3.2.8

Any help is greatly appreciated. Thanks in advance!! :)

like image 663
user3903418 Avatar asked Jun 02 '26 07:06

user3903418


1 Answers

The simplest way to solve this problem is to use the standard implementation to get all the matches, then remove matches which don't have a word delimiter before and after the first and last character. In the average case, there won't be a significant performance hit because you will have few matches.

like image 166
Petar Atanasov Avatar answered Jun 04 '26 00:06

Petar Atanasov



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!