Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find the longest word given a collection

It is a google interview question and I find most answers online using HashMap or similar data structure. I am trying to find a solution using Trie if possible. Anybody could give me some hints?

Here is the question: You are given a dictionary, in the form of a file that contains one word per line. E.g.,

abacus  deltoid  gaff  giraffe  microphone  reef  qar  

You are also given a collection of letters. E.g.,

{a, e, f, f, g, i, r, q}.  

The task is to find the longest word in the dictionary that can be spelled with the collection of letters. For example, the correct answer for the example values above is “giraffe”. (Note that “reef” is not a possible answer, because the set of letters contains only one “e”.)

Java implementation would be preferred.

like image 744
King Saber Avatar asked Jun 01 '13 04:06

King Saber


1 Answers

No Java code. You can figure that out for yourself.

Assuming that we need to do this lots of times, here's what I'd do:

  • I'd start by creating "signatures" for each word in the dictionary consisting of 26 bits, where bit[letter] is set iff the word contains one (or more) instances of letter. These signatures can be encoded as a Java int.

  • Then create a mapping that maps signatures to lists of words with that signature.

To do a search using the precomputed map:

  • Create the signature for the set of letters you want to find the words for.

  • Then iterate over the keys of the mapping looking for keys where (key & (~signature) == 0). That gives you a short list of "possibles" that don't contain any letter that is not in the required letter set.

  • Iterate over the short list looking for words with the right number of each of the required letters, recording the longest hit.


Notes:

  1. While the primary search is roughly O(N) on the number of words in the dictionary, the test is extremely cheap.

  2. This approach has the advantage of requiring a relatively small in-memory data structure, that (most likely) has good locality. That is likely to be conducive to faster searches.


Here's an idea for speeding up the O(N) search step above.

Starting with the signature map above, create (precompute) derivative maps for all words that do contain specific pairs letters; i.e. one for words containing AB, for AC, BC, ... and for YZ. Then if you are looking for words containing (say) P and Q, you can just scan the PQ derivative map. That will reduce O(N) step by roughly 26^2 ... at the cost of more memory for the extra maps.

That can be extended to 3 or more letters, but the downside is the explosion in memory usage.

Another potential tweak is to (somehow) bias the selection of the initial letter pair towards letters/pairs that occur infrequently. But that adds an up-front overhead which could be greater than the (average) saving you get from searching a shorter list.

like image 66
Stephen C Avatar answered Oct 13 '22 17:10

Stephen C