Analyze text (lemmatization, edit distance)

Question

I need to analyze the text to exist in it banned words. Suppose the black list is the word: "Forbid". The word has many forms. In the text the word can be, for example: "forbidding", "forbidden", "forbad". To bring the word to the initial form, I use a process lemmatization. Your suggestions?

What about typos?
For example: "F0rb1d". I think use damerau–Levenshtein or another. You suggestions?

And what if the text is written as follows:
"ForbiddenInformation.Privatecorrespondenceofthecompany." OR "F0rb1dden1nformation.Privatecorresp0ndenceofthec0mpany." (yes, without whitespace)

How to solve this problem?
Preferably fast algorithm, because text are processed in real time.
And maybe what some tips to improve performance (how to store, etc)?

Aurimas Neverauskas · Accepted Answer

there're two possible solutions as far as I know algorithms.

You could try to use dynamic programming , LCS (longest common subsequence). It will search original text for the desired word as pattern, I believe it's O(mn):

http://en.wikipedia.org/wiki/Longest_common_subsequence_problem http://www.ics.uci.edu/~eppstein/161/960229.html

Although the easier would be to use text search algorithm. The best I know is KMP and it's O(n). For character comparison you could group them into sets like {i I l(L) 1}, {o O 0} and so on. Yet you could modify this for not matching all letters (forbid -> forbad).

http://en.wikipedia.org/wiki/Knuth-Morris-Pratt_algorithm

So now you could compare benefits of these two and yours suggestion.

Analyze text (lemmatization, edit distance)

Tags:

c#

nlp

similarity

lemmatization

user348173

1 Answers

Aurimas Neverauskas

Recent Activity

Donate For Us

Analyze text (lemmatization, edit distance)

Tags:

c#

nlp

similarity

lemmatization

user348173

1 Answers

Aurimas Neverauskas

Related questions

Recent Activity

Donate For Us