Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to implement a dictionary (Trie vs HashTable and important issues)?

I've ran across several questions and articles saying that dictionary implementation in java is done best using tries. But most of them didn't address important issues, as far as I saw it. So, next is a real world task:

Let's assume that I need to implement a dictionary (let's say something like Lingvo, but simpler) using java. For my particular task it is needed to store word definitions and to perform fast dictionary lookup.

Please, address next questions:

  • What data structure should I use then (Trie or HashTable)?
  • How should it(search, datastruct) be organised if I need the dictionary to be case insensitive?
  • What if I want it(search, dictionary) to be case sensitive?

P.S.: Code examples are highly appreciated. :)

Thanks for answers in advance.

UPDATE:If we're talking about standard DS implementations in java, is it true that HashTable will be the best one for this particular task? Why not HashMap, TreeMap or LinkedHashMap?

like image 435
Denys S. Avatar asked Jan 14 '11 12:01

Denys S.


People also ask

How can u implement a dictionary efficiently?

Approach: We can use a Trie to efficiently store strings and search them. Here, an implementation of a dictionary using Trie (memory optimization using hash-map) is discussed. We add another field to Trie node, a string which will hold the meaning of a word.

How do you choose between hash table and trie?

Therefore, if we want a full-text lookup application, the hash table is better as it has a faster lookup speed. However, if we want to return all the words that match a prefix, the trie is our solution.

When should we use dictionary or HashTable?

In Hashtable, you can store key/value pairs of the same type or of the different type. In Dictionary, you can store key/value pairs of same type. In Hashtable, there is no need to specify the type of the key and value. In Dictionary, you must specify the type of key and value.

How is trie data structure implemented?

Implementation of Trie Insertion proceeds by walking the Trie according to the string to be inserted, then appending new nodes for the suffix of the string that is not contained in the Trie.


2 Answers

I want to address just one point in your question:

A trie is not a general-purpose dictionary data structure. The reason is that a trie is a specialized search tree for (sub)string search. Generally, you will be more interested in general search trees, e.g. binary search trees or B-trees.

All these implementations rely on an ordering of the dictionary elements, and all of them have a logarithmic average-case and worst-case runtime for common operations.

A hash table, by contrast, does not require a relative ordering of the elements. Instead, it requires that elements are hashable and equality comparable. The worst-case characteristic of common hash table characteristics is much worse than for trees, namely linear in the number of elements.

However, with a bit of care the average case for hash tables operations can be made constant (i.e. independent of the container size). What’s more, it can be proven that slower operations are exceedingly rare.

In practice, this means that except for very specialized use-cases, hash tables beat tree-based dictionaries hands down.

The downside to this is that hash tables impose an arbitrary-seeming order on its elements. If you are interested in getting the items from your dictionary in sorted order, hash tables are not for you.

(There are other interesting implementations of dictionaries, e.g. skip lists which rival search trees and probabilistic implementations like the Bloom filter.)

A trie-based implementation can only be used if you are dealing with a dictionary of string values, in which case it is actually often a good choice, in particular if many strings in the dictionary share common prefixes and are rather short.

like image 121
Konrad Rudolph Avatar answered Nov 06 '22 23:11

Konrad Rudolph


EDIT stop upvoting this: I misread the question. The OP is not after a dictionary to verify word spellings/suggestions/type-ahead-lookup/auto-completion/whatever (which I thought was what he was after). The OP is after a key/value mapping where for each word there's a definition.

Having worked on dictionaries, I can tell you that you're taking the wrong approach.

It's not as simple as a choice between an hashtable or a trie.

You mention Lingvo: it's much more than just a table.

Do you want close match to be offered suggestions? You may then need to things like generating permutations on what the user entered and for each permutation see if it exists in the dico: if it does, you'd then need to compute its' Levenhstein Edit Distance and suggest first the words that have the shortest LED.

Do you want most likely matches to be auto-completed/suggested (like what Google does)? You'd then need a very advanced data structure like a BK-tree (basically a tree of LED if I understand it correctly).

How many words will you have in your dictionary? You won't be able to use a dictionary made of 400 000 words using Strings and other heavyweight Java objects / data structure without a serious performance hit (once again: a dictionary is more than just one hashtable, a dictionary typically involve several datastructures). This won't easily fit in your users' computer memory. There are known, searchable, ways to store words where every single word can be packed on fewer than 15 bits per word (fewer than 15 bits per word, you read correctly).

In addition to that, you may want to do suggestion based on phonetics: like by using a double-metaphone mapping.

A dictionary, as in a "word dictionary" is so much more than just a key/value table. It is really a complicated beast due to which features the user shall except and due to the amount of data involved. Just plain english + a few specialized domains terminologies, medical, comp-sci, whatever. will give you hundreds of thousands of data: try to put that in a Java HashMap and... Kaboom!

like image 27
Gugussee Avatar answered Nov 07 '22 01:11

Gugussee