Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithms and Data Structures best suited for a spell checker, dictionary and a thesaurus

Best way to implement a

  • dictionary (is there any DS better than Trie for Dictionary)
  • thesaurus (no idea, as match is made on meanings of the words, similar meanings)
  • spell checker (something better than a hash map), if possible with correct spelling recommendations.

When asked in a one hour interview, are we expected to write a c/c++ code, for the algorithm?

like image 212
Vivek Sharma Avatar asked Oct 06 '09 08:10

Vivek Sharma


People also ask

Which data structure is used for dictionary and spell checker?

The simplest data structure that is used for spell and dictionary cheking is Hashing.

What algorithm does spell check use?

Spell checkers can use approximate string matching algorithms such as Levenshtein distance to find correct spellings of misspelled words. An alternative type of spell checker uses solely statistical information, such as n-grams, to recognize errors instead of correctly-spelled words.

Which data structure is best for dictionary?

If we want both operations, look up and prefix search, Trie is suited. With Trie, we can support all operations like insert, search, delete in O(n) time where n is length of the word to be processed. Another advantage of Trie is, we can print all words in alphabetical order which is not possible with hashing.

What type of data structure is a dictionary?

A dictionary is an unordered data structure with elements separated by a comma and stored as a key-value pair. It is enclosed within curly brackets. Also, the key-value pair is separated by a : (colon).


2 Answers

I can see no better data structure than a trie for the dictionary and the thesaurus. Both can be fitted in one structure if needed with one link in the node pointing to the meaning of the word (dictionary) and one to synonyms (thesaurus). It can even offer some form of autocompletion (when there's only one link in the node).

Spelling corrector is a bit trickier - since one has to map fals input to some kind of correct input. You can take this link as a start: http://en.wikipedia.org/wiki/Spell_checker. At the end you'll find links to papers about different algorithms. According to the wikipedia article, this paper describes the most successfull algorithm: Andrew Golding and Dan Roth's "Winnow-based spelling correction algorithm"

like image 193
Tobias Langner Avatar answered Nov 15 '22 13:11

Tobias Langner


See this for a 21-line Python 2.5 spelling corrector and a bit of background.

like image 30
Anton Gogolev Avatar answered Nov 15 '22 13:11

Anton Gogolev