I have a huge set of key words. Given a text , I want to be able to recognize only those words that occur in the key list of words and ignore all the other words. What is the best way to approach this?
The Aho-Corasick algorithm is a fast algorithm for recognizing a set of pattern strings in a larger source string. It's employed by several search utilities, along with many antivirus programs, since it runs in time O(m + n + z), where n is the total size of all the pattern strings you're trying to match, m is the length of the string to search, and z is the total number of matches. Moreover, if you know in advance what strings you're searching for, you can do the O(n) work offline and reduce the search time to O(m + z).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With