Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How Can I Speed Up This Anagram Algorithm

I am making a mobile app to find anagrams and partial matches. Mobile is important because there is not a whole lot of computational power, and efficiency is key.

The algorithm takes any number of letters, including repeats, and finds the longest words made up from its letters, using every letter only once. I am also interested in finding the top results quickly, and am not really concerned with the bottoms (shorter ones) as long as N is met. For example:

STACK => stack, tacks, acts, cask, cast, cats…

I have done some googling and have found a few algorithms, and I came up with one which I thought would be efficient, but is not as efficient as I would like.

I have a lookup dictionary pre-made that maps a sorted key to the real words that generate that key.

"aelpp" => ["apple", "appel", "pepla"]

I have further split each dictionary into different ones based on the length of the key. So keys that are 5 letters long are in one dictionary, keys that are 6 are in another. Each of these dictionaries are in an array in which the index is the length of the keys that are found in the dictionary.

anagramArray[5] => dictionary5
dictionary5["aelpp"] => ["apple", "appel", "pepla"]

My algorithm starts by taking an input word "lappe", and it sorts it:

"lappe" => "aelpp"

Now, for each dictionary that has contents of at most 5 letters, I do a comparison to pull it out. Here is pseudocode:

word = input.sort
for (i = word.length; i > 0; i--)
    dictionaryN = array[i]
    for (key in dictionaryN)
        if word matches key
            add to returnArray
        end
    end
    if returnArray count > N
      break
    end
end

returnArray.sort by longest word, alphabetize

The dictionary only has about 170,000 words in it, but searches are taking up to 20 seconds for 12 letter inputs. My match method makes a regex out of the key:

"ackst" => /a.*c.*k.*s.*t.*/

such that, for example, a 4 letter key such as acst (acts), will match ackst (stack) because:

"ackst" matches /a.*c.*s.*t.*/

I have seen other apps do the same thing in much less time, and I am wondering if my approach is garbage or just needs some tweaking.

How can I get the maximum computational efficiency for generating the top N anagrams from a word, sorted by max length?

like image 641
coneybeare Avatar asked Jul 02 '11 04:07

coneybeare


People also ask

What is anagram algorithm?

The anagram algorithm is a simple algorithm. Create a function where you compare two strings and check if they are anagrams of each other. The strings can contain any type of characters like “Hi, there!” and “There… hI!


1 Answers

If you think of (and perhaps even represent) a dictionary as a tree of letters you can avoid looking at lots of nodes. If "stack" is in the dictionary, then there will be a path from the root to a leaf labelled a-c-k-s-t. If the input word is "attacks" then sort this to get aackstt. You can write a recursive routine to follow links down from the root, consuming letters from aackstt as you go. When you reach a-c-k you will have stt left in your string, so you can follow the s to reach ackst, but you can rule out following u to reach a-c-k-u and its descendants, v to reach a-c-k-v and its descendants, and so on.

In fact, with this scheme, you could use just one tree to hold words of any number of letters, which should save you doing multiple searches, one for each target length.

like image 196
mcdowella Avatar answered Sep 18 '22 22:09

mcdowella