Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm for Comparing Words (Not Alphabetically)

I need to code a solution for a certain requirement, and I wanted to know if anyone is either familiar with an off-the-shelf library that can achieve it, or can direct me at the best practice. Description:

The user inputs a word that is supposed to be one of several fixed options (I hold the options in a list). I know the input must be in a member in the list, but since it is user input, he/she may have made a mistake. I'm looking for an algorithm that will tell me what is the most probable word the user meant. I don't have any context and I can’t force the user to choose from a list (i.e. he must be able to input the word freely and manually).

For example, say the list contains the words "water", “quarter”, "beer", “beet”, “hell”, “hello” and "aardvark".

The solution must account for different types of "normal" errors:

  • Speed typos (e.g. doubling characters, dropping characters etc)
  • Keyboard adjacent-character typos (e.g. "qater" for “water”)
  • Non-native English typos (e.g. "quater" for “quarter”)
  • And so on...

The obvious solution is to compare letter-by-letter and give "penalty weights" to each different letter, extra letter and missing letter. But this solution ignores thousands of "standard" errors I'm sure are listed somewhere. I'm sure there are heuristics out there that deal with all the cases, both specific and general, probably using a large database of standard mismatches (I’m open to data-heavy solutions).

I'm coding in Python but I consider this question language-agnostic.

Any recommendations/thoughts?

like image 309
Roee Adler Avatar asked May 19 '09 16:05

Roee Adler


3 Answers

You want to read how google does this: http://norvig.com/spell-correct.html

Edit: Some people have mentioned algorithms that define a metric between a user given word and a candidate word (levenshtein, soundex). This is however not a complete solution to the problem, since one would also need a datastructure to efficiently perform a non-euclidean nearest neighbour search. This can be done e.g. with the Cover Tree: http://hunch.net/~jl/projects/cover_tree/cover_tree.html

like image 197
bayer Avatar answered Oct 14 '22 05:10

bayer


A common solution is to calculate the Levenshtein distance between the input and your fixed texts. The Levenshtein distance of two strings is just the number of simple operations - insertions, deletions, and substitutions of a single character - required to turn one of the string into the other.

like image 37
Daniel Brückner Avatar answered Oct 14 '22 05:10

Daniel Brückner


Have you considered algorithms that compare by phonetic sounds, such as soundex? It shouldn't be too hard to produce soundex representations of your list of words, store them, and then get a soundex of the user input and find the closest match there.

like image 2
workmad3 Avatar answered Oct 14 '22 06:10

workmad3