Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Analyze text (lemmatization, edit distance)

I need to analyze the text to exist in it banned words. Suppose the black list is the word: "Forbid". The word has many forms. In the text the word can be, for example: "forbidding", "forbidden", "forbad". To bring the word to the initial form, I use a process lemmatization. Your suggestions?

What about typos?
For example: "F0rb1d". I think use damerau–Levenshtein or another. You suggestions?

And what if the text is written as follows:
"ForbiddenInformation.Privatecorrespondenceofthecompany." OR "F0rb1dden1nformation.Privatecorresp0ndenceofthec0mpany." (yes, without whitespace)

How to solve this problem?
Preferably fast algorithm, because text are processed in real time.
And maybe what some tips to improve performance (how to store, etc)?

like image 366
user348173 Avatar asked Apr 03 '11 12:04

user348173


1 Answers

there're two possible solutions as far as I know algorithms.

You could try to use dynamic programming , LCS (longest common subsequence). It will search original text for the desired word as pattern, I believe it's O(mn):

http://en.wikipedia.org/wiki/Longest_common_subsequence_problem http://www.ics.uci.edu/~eppstein/161/960229.html

Although the easier would be to use text search algorithm. The best I know is KMP and it's O(n). For character comparison you could group them into sets like {i I l(L) 1}, {o O 0} and so on. Yet you could modify this for not matching all letters (forbid -> forbad).

http://en.wikipedia.org/wiki/Knuth-Morris-Pratt_algorithm

So now you could compare benefits of these two and yours suggestion.

like image 103
Aurimas Neverauskas Avatar answered Sep 21 '22 07:09

Aurimas Neverauskas