Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do spell checkers work?

Tags:

I need to implement a spell checker in C. Basically, I need all the standard operations... I need to be able to spell check a block of text, make word suggestions and dynamically add new words to the index.

I'd kind of like to write this myself, tho I really don't know where to begin.

like image 451
dicroce Avatar asked Dec 06 '08 20:12

dicroce


People also ask

Are spell checkers reliable?

An ordinary spell checker will find few or no errors in the above sentence. This is because spell checkers can only detect if words are spelled correctly, not if they are used correctly. That being said, a spell checker is a handy tool and, therefore, should not be completely abandoned.

What are the three things spell checkers do not recognize?

Spell Checkers do not always recognize proper nouns, which are names of people, places, and terms. Many will try to either correct them into something else, or just flag them as errors.

What are the advantages of spell checkers?

Spell check lets you know when words are misspelled, corrects misspelled words as you type, and allows you to search a whole document for misspelled words.

What is spell checker in word?

In Microsoft Word documents, Word's spell check function is set to automatically check your spelling while you type. Errors in your document will have color-coded underlines reflecting your choices, like red for spelling errors, green for grammar errors, and blue for contextual spelling errors.


1 Answers

Read up on Tree Traversal. The basic concept is as follows:

  1. Read a dictionary file into memory (this file contains the entire list of correctly spelled words that are possible/common for a given language). You can download free dictionary files online, such as Oracle's example dictionary.
  2. Parse this dictionary file into a search tree to make the actual text search as efficient as possible. I won't describe all of the dirty details of this type of tree structure, but the tree will be made up of nodes which have (up to) 26 links to child nodes (one for each letter), plus a flag to indicate wether or not the current node is the end of a valid word.
  3. Loop through all of the words in your document, and check each one against the search tree. If you reach a node in the tree where the next letter in the word is not a valid child of the current node, the word is not in the dictionary. Also, if you reach the end of your word, and the "valid end of word" flag is not set on that node, the word is not in the dictionary.
  4. If a word is not found in the dictionary, inform the user. At this stage, you can also suggest alternate spellings, but that gets a tad more complicated. You will have to loop through each character in the word, substituting alternate characters and test each of them against the search tree. There are probably more efficient algorithms for finding the recommended words, but I don't know what they are.

A really short example:

Dictionary:

apex apple appoint appointed

Tree: (* indicates valid end of word) update: Thank you to Curt Sampson for pointing out that this data structure is called a Patricia Tree

A -> P -> E -> X*
      \\-> P -> L -> E*
           \\-> O -> I -> N -> T* -> E -> D*

Document:

apple appint ape

Results:

  • "apple" will be found in the tree, so it is considered correct.
  • "appint" will be flagged as incorrect. Traversing the tree, you will follow A -> P -> P, but the second P does not have an I child node, so the search fails.
  • "ape" will also fail, since the E node in A -> P -> E does not have the "valid end of word" flag set.

edit: For more details on spelling suggestions, look into Levenshtein Distance, which measures the smallest number of changes that must be made to convert one string into another. The best suggestions would be the dictionary words with the smallest Levenshtein Distance to the incorrectly spelled word.

like image 103
e.James Avatar answered Sep 20 '22 19:09

e.James