I need to analyze the text to exist in it banned words. Suppose the black list is the word: "Forbid". The word has many forms. In the text the word can be, for example: "forbidding", "forbidden", "forbad". To bring the word to the initial form, I use a process lemmatization. Your suggestions?
What about typos?
For example: "F0rb1d". I think use damerau–Levenshtein or another. You suggestions?
And what if the text is written as follows:
"ForbiddenInformation.Privatecorrespondenceofthecompany." OR
"F0rb1dden1nformation.Privatecorresp0ndenceofthec0mpany." (yes, without whitespace)
How to solve this problem?
Preferably fast algorithm, because text are processed in real time.
And maybe what some tips to improve performance (how to store, etc)?
there're two possible solutions as far as I know algorithms.
You could try to use dynamic programming , LCS (longest common subsequence). It will search original text for the desired word as pattern, I believe it's O(mn):
http://en.wikipedia.org/wiki/Longest_common_subsequence_problem http://www.ics.uci.edu/~eppstein/161/960229.html
Although the easier would be to use text search algorithm. The best I know is KMP and it's O(n). For character comparison you could group them into sets like {i I l(L) 1}, {o O 0} and so on. Yet you could modify this for not matching all letters (forbid -> forbad).
http://en.wikipedia.org/wiki/Knuth-Morris-Pratt_algorithm
So now you could compare benefits of these two and yours suggestion.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With