Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm to generate all possible letter combinations of given string down to 2 letters

Algorithm to generate all possible letter combinations of given string down to 2 letters

Trying to create an Anagram solver in AS3, such as this one found here:

http://homepage.ntlworld.com/adam.bozon/anagramsolver.htm

I'm having a problem wrapping my brain around generating all possible letter combinations for the various lengths of strings. If I was only generating permutations for a fixed length, it wouldn't be such a problem for me... but I'm looking to reduce the length of the string and obtain all the possible permutations from the original set of letters for a string with a max length smaller than the original string. For example, say I want a string length of 2, yet I have a 3 letter string of “abc”, the output would be: ab ac ba bc ca cb.

Ideally the algorithm would produce a complete list of possible combinations starting with the original string length, down to the smallest string length of 2. I have a feeling there is probably a small recursive algorithm to do this, but can't wrap my brain around it. I'm working in AS3.

Thanks!

like image 483
Alan Avatar asked Mar 13 '10 18:03

Alan


1 Answers

For the purpose of writing an anagram solver the kind of which you linked, the algorithm that you are requesting is not necessary. It is also VERY expensive.

Let's look at a 6-letter word like MONKEY, for example. All 6 letters of the word are different, so you would create:

  • 6*5*4*3*2*1 different 6-letter words
  • 6*5*4*3*2 different 5-letter words
  • 6*5*4*3 different 4-letter words
  • 6*5*4 different 3-letter words
  • 6*5 different 2-letter words
  • For a total of 1950 words

Now, presumably you're not trying to spit out all 1950 words (e.g. 'OEYKMN') as anagrams (which they are, but most of them are also gibberish). I'm guessing you have a dictionary of legal English words, and you just want to check if any of those words are anagrams of the query word, with the option of not using all letters.

If that is the case, then the problem is simple.

To determine if 2 words are anagrams of each other, all you need to do is count how many times each letters are used, and compare these numbers!

Let's restrict ourself to only 26 letters A-Z, case insensitive. What you need to do is write a function countLetters that takes a word and returns an array of 26 numbers. The first number in the array corresponds to the count of the letter A in the word, second number corresponds to count of B, etc.

Then, two words W1 and W2 are exact anagram if countLetters(W1)[i] == countLetters(W2)[i] for every i! That is, each word uses each letter the exact same number of times!

For what I'd call sub-anagrams (MONEY is a sub-anagram of MONKEY), W1 is a sub-anagram of W2 if countLetters(W1)[i] <= countLetters(W2)[i] for every i! That is, the sub-anagram may use less of certain letters, but not more!

(note: MONKEY is also a sub-anagram of MONKEY).


This should give you a fast enough algorithm, where given a query string, all you need to do is read through the dictionary once, comparing the letter count array of each word against the letter count array of the query word. You can do some minor optimizations, but this should be good enough.

Alternatively, if you want utmost performance, you can preprocess the dictionary (which is known in advance) and create a directed acyclic graph of sub-anagram relationship.

Here's a portion of such a graph for illustration:

 D=1,G=1,O=1  ----------> D=1,O=1
  {dog,god}   \            {do,od}
               \
                \-------> G=1,O=1
                           {go}

Basically each node is a bucket for all words that have the same letter count array (i.e. they're exact anagrams). Then there's a node from N1 to N2 if N2's array is <= (as defined above) N1's array (you can perform transitive reduction to store the least amount of edges).

Then to list all sub-anagrams of a word, all you have to do is find the node corresponding to its letter count array, and recursively explore all nodes reachable from that node. All their buckets would contain the sub-anagrams.

like image 97
polygenelubricants Avatar answered Oct 05 '22 06:10

polygenelubricants