Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm in .NET to detect simple typos

Is there an existing algorithm for .NET that would be able to detects typos from a list of predefined words?

Example, suppose the word "Stuff" is in my list and someone types "Stuf", "sutff", or "stff" or "stiff". I would like to be able to suggest the person that the word "Stuff" might be the right word.

I am not talking about anything grammatical or anything more than maybe one letter missing, substituted or mixed.

My goal is to prevent the same word being entered in a list with different typing. Not that uppercase and lowercase do not cause a problem for me as everything is always lowercase.

like image 356
David Brunelle Avatar asked Dec 09 '22 06:12

David Brunelle


2 Answers

This is a well-studied problem and there are many good algorithms for doing this. Most of them work by constructing some sort of data structure to hold all of the legal words in a way where words with similar edit distance can be found efficiently. Informally, the edit distance between two strings is the number of changes you would need to make to transform one string into another. For example, given the words "misspell" and "mispell," the edit distance is one (just insert another 's' into the word), while the edit distance between "cat" and "dog" is three (replace each letter).

A misspelled word is likely only a small edit distance away from the word that was intended, so if you can store the words in a way where you can, for any string, query words that are a small edit distance away from the string, you can provide a candidate list of possible words that the user may have meant.

One common data structure for holding this data is a trie, a 26-way tree structure where each node stores a prefix of a word and each edge corresponds to appending some letter to the current prefix. If you have such a structure, you can find words that are "close" to a particular word (perhaps a certain edit distance away) using a simple recursive tree search. At each point, keep track of how far of an edit distance you wish to be away from the target word and how much of the misspelled word that you have processed so far. At each point, you can either follow the edge in the trie corresponding to the letter in the word, or you can use up one edit distance to insert a new letter by following a different edge in the trie.

Another data structure that is often used for this is a BK-tree, which stores strings in a way where you can efficiently query all words a certain edit distance away from some source string. This would more directly solve your problem, though there are fewer online resources on how to construct BK-trees compared with tries.

Once you've found the words that are within some edit distance, you'll probably want to rank them somehow before presenting them to the user. This requires you to have some idea of what sorts of typos people tend to make in practice. Common typos include

  • Transpositions, where two letters are swapped: "thme" instead of "them"
  • Substitutions, where the wrong letter is used: "arithmatic" instead of "arithmetic"
  • Deletions, where a letter is left out: "helo" instead of "hello"
  • Repetition, where a letter is duplicated: "tommorrow" instead of "tomorrow"

To build a good spell checker, you would ideally have some sort of probabilities associated with each type of mistake. That way, when you had your list of possible corrections, you could rank them from most probable to least probable.

Hope this helps!

like image 107
templatetypedef Avatar answered Dec 23 '22 13:12

templatetypedef


Here's a good step by step to do what you are looking for implemented in python, but there are links to implementations in C# and other languages too.

http://norvig.com/spell-correct.html

like image 38
Jacob Eggers Avatar answered Dec 23 '22 12:12

Jacob Eggers