I want to implement the following in C++:
1) Check whether the given word exists in a dictionary. The dictionary file is a huge file; consider 100MB or 3-4 Million words.
2) Suggest corrections for the incorrect word.
3) Autocomplete feature.
1) I am planning to build a tree so searching will efficient.
2) I am not getting how to implement auto correction feature.
3) I can implement auto complete feature using trees
What's the best data structure and algorithm to implement all the above features?
The basic algorithm behind autocorrection software like T9 is pretty simple. The system is essentially the same as a word processor's spell checker—as you type, the software checks each word against a built-in dictionary, and it suggests alternatives when it doesn't find a match.
Auto-correct is a feature available on Android and Apple smartphones, PCs and applications like Outlook. Its goal is to make typing a little easier by automatically changing words, letters or punctuation it thinks you entered incorrectly.
Inside The AI: how we build our autocorrect, emoji and text prediction engine. Although it's Typewise's unique hexagonal keyboard that tends to catch people's eye at first, it's our advanced autocorrect and text prediction AI under the hood that provides the power behind our product.
I have been working on the same problem. So far the best solution I have come across is using a ternary search tree for auto completion. Ternary Search Trees are more space efficient than tries. If im unable to find the looked up string in my ternary search tree then I use an already built BK Tree for finding the closest match. BK Tree internally uses Levenshtein distance. You
Metaphones are also something you might want to explore however I havent gone into the depth of metaphones.
I have a solution in Java for BK TREE + TERNARY SEARCH TREE if you like.
You can do autocomplete by looking at all the strings in a given subtree. Some score to help you pick might help. This works something like if you have "te" you go down that path in the trie and the traverse the entire subtree there to get all the possible endings.
For corrections you need to implement something like http://en.wikipedia.org/wiki/Levenshtein_distance over the tree. You can use the fact that if you processed a given path in the trie, you can reuse the result for all the strings in the subtree rooted at the end of your path.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With