Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find words from scrambled words

I'm trying to find a way to find specific words in scrambled text that appear consecutively. Characters that aren't found will have a X in place.

For example let's say the list of dictionary words are:

jane
john
brownbag
foo
youth

and scrambled text:

ofozlhuoyt => fooXXyouth
yuawbnrobgajen => XXbrownbagjane
janjeohn => (nothing since jane and john aren't consecutive)

Approach I'm trying:

Say, I have a hash with keys a through z with set as values for each key. Each number in the set will represent the index where of word that contains the particular character.

From above example:

{a: [0,2]}
{b: [2]}
{c: []}
{e: [0]}
{f: [3]}
{g: [2]}
{h: [1,4]}
{j: [0,1]}
...
{n: [0,1,2]}
{o: [1,2,3,4]}
{r: [2]}
{u: [4]}
{t: [4]}
{w: [2]}
{y: [4]}
...
{z: []} 

After preparing the above, we can begin looking at each character of the scrambled text:

First String: ofozlhuoyt

  1. o => exists in 1, 2, 3, and 4

  2. start with 1: jane (length 4)

  3. get 4 chars: ofoz

  4. "jane".sort(false) == "ofoz".sort(false)?

  5. if false: repeat steps 1 to 3 for 2 (john)

  6. if true: add foo to list of good words and start step 0 with z

Is there a better way of doing this? I feel a better data structure exists to resolve something like this but I can't figure out which to use..

like image 240
Anthony Avatar asked Nov 12 '13 23:11

Anthony


People also ask

What is it called when you have to find a word with scrambled letters?

Anagrams. Anagram scramble is a part of word scramble. Anagram scramble refers to words or phrases that are spelled by rearranging letters. The word 'anagram' means to discover hidden meanings by reading letters out of order. But when you play anagram scramble, you don't have to find any hidden purpose as such.

Is there a word finder?

Word Finder is the perfect help when your vocabulary fails you. It's a unique word search tool, which finds all possible words made up of the letters you input. People use Word Finder for various reasons, but the primary one is to win games such as Scrabble and Words With Friends.

How do you get a high score on scramble words?

Scramble Words Tips and TricksThe more words you make, the higher your score will be. While you still have to complete all of the target words to move forward, you'll want to create as many bonus words as possible before the level ends.


2 Answers

You can use prime numbers!

When you multiply n prime numbers, the product you get will be different from any other combination of primes.

In your problem, the key is that order does not matter so sorting will be a waste of time. In other words,

'jane' == 'ejna' == 'jnea' == ...

So, you can create your own hash function based on the cool prime property, and use commutativity over multiplication to avoid sorting/string-searches altogether. And in python, you don't even have to worry about the size of ints; that will come in handy in case your dictionary has really big words.

Below is a simple dict mapping letters to the first 26 primes, and the accompanying hash function.

letters_to_primes = {'a': 2, 'b': 3, 'c': 5, 'd': 7, ... 'x': 89, 'y': 97, 'z': 101}

def my_prime_hash(word):
    sum = 1
    for letter in word:
        sum = sum * letters_to_primes[letter] # Multiplication is commutative!
    return sum

Again, the key property we're exploiting here is that

my_prime_hash('jane') == my_prime_hash('enaj') == ... == 27434

Now we simply need to create our dict of the given dictionary words. I propose an external-chaining hash-table. Let us call it 'hashed_words'.

# Given these words
words = ['jane', 'john', 'brownbag', 'foo', 'youth', 'nib', 'bin']

# Compute the hash table
hashed_words = {}
for word in words:
    w_hash = my_prime_hash(word)
    if w_hash in hashed_words: hashed_words[w_hash].append(word)
    else: hashed_words[w_hash] = [word]

After running it, hashed_words looks like:

{1113571: ['john'], 27434: ['jane'], 
 28717: ['foo'], 448956643: ['youth'], 
 3131090838L: ['brownbag'], 2967: ['nib', 'bin']}

which is what we want.

Now you can start hashing the scrambled word by computing the products of the letters and check at every point if the product is in hashed_words. A state machine like the one others have proposed is necessary for cases like 'mart' and 'smart' in the scrambled word 'mrtasgth' (see comments below).

Note: Instead of assigning prime numbers in ascending order, you could consider the frequency distribution of all the letters occurring in your dictionary and assign the lowest prime number to the letter with the highest frequency. This will indeed save memory while creating your 'hashed_words' hash table.

like image 194
slider Avatar answered Oct 22 '22 09:10

slider


There is a potentially faster way, provided you have enough memory to implement it.

First, generate all the permutations for each word. So for "jane" you would have:

aejn
aenj
ajen
ajne
anej
anje
etc.

Then, build a state machine for the Aho-Corasick algorithm, with each of the permutations for a single word going to the same ending state. That ending state would output the string you're looking for.

Now run the text through the state machine. The output would be the words that are found, and their positions. You can then sort the found words by position and determine if they appear consecutively.

The state machine is potentially very large (n! states for each word, where n is the number of characters in the word), and will take some time to build. But once it's built, it matches very quickly. If your word list is static and you have a lot of text to search, this is the way to go. Provided you have enough memory.

I used a modified Aho-Corasick algorithm that was searching text for occurrences of millions of phrases (band and song names) in video titles. The state machine occupied about 10 gigabytes of RAM and took about an hour to build, but it was fast when it came to matching.

like image 42
Jim Mischel Avatar answered Oct 22 '22 07:10

Jim Mischel