Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Need help with a word-packing algorithm

I have a list of sub-lists of letters, where the number of letters in each sub-list can vary. The list and sub-lists are ordered. This structure can be used to produce words by choosing a number X, taking a letter from position X in every sub-list and concatenating them in order. If the number X is larger than the length of the sub-list, it would wrap around.

Given a set of words, I need to find a way to pack them into the smallest possible structure of this kind (i.e. with the shortest sub-lists). There would have to be as many sub-lists as the number of letter in the longest word, of course, and shorter words would be padded by blanks/spaces.

I am not a CS graduate so I apologize if the description of the problem is not entirely clear. To give a simple example: Suppose I have the words [ 'a ', 'an', 'if', 'is', 'in', 'on', 'of', 'i '] I need to pack, I could use the following structure:

[  
    [ 'i', 'o', 'a' ],  
    [ 's', 'n', 'f', ' ' ]  
]

This would enable me to produce the following words:

0: is  
1: on  
2: af*  
3: i  
4: os*  
5: an  
6: if  
7: o *  
8: as*  
9: in  
10: of  
11: a

If you take position 10, for example, the word 'of' is generated by concatenating the letter at index 10 % 3 (= 1) from the first sub-list, with the letter at index 10 % 4 (= 2) from the second sub-list.

My best attempt so far involves using a matrix of hamming distances to place the most-"connected" words first, and then their closest neighbors, with the goal of minimizing the change with every insertion. This was an entirely intuitive attempt and I feel like there has to be a better/smarter way to solve this.

Clarification

This is a practical problem I am trying to solve and the constraints are (roughly) as follows:
1. The number of characters per sub-list should be in the area of 100 or less.
2. The keyspace should be as small as possible (i.e. the number of spurious words should be minimal). Roughly, a keyspace in the millions of options is borderline.

I don't know that a good solution is even possible for this. With the algorithm I have right now, for example, I can insert about 200 words (just random English words) in a keyspace of 1.5 million options. I'd like to do better than that.

like image 450
szx Avatar asked Aug 16 '10 23:08

szx


1 Answers

Well, you said you're interested in sub-optimal solutions, so I'll give you one. It depens on the alphabet size. For example, for 26 array size will be little over 100 (regardless of amount of words to encode).

It's well-known that if you have two different prime numbers a and b and non-negative integers k and l (k < a, l < b), you can find number n that n % a == k and n % b == l.
For example, with (a = 7, a = 13, k = 6, l = 3) you can take n = 7 * 13 + 7 * 3 + 13 * 6. n % 7 == 6 and n % 13 == 3

And same holds for any number of prime integers.

You can initialize arrays like this.

['a', 'b', 'c', ... 'z', 'z', 'z', 'z', ...]   # array size = 29
['a', 'b', 'c', ... 'z', 'z', 'z', 'z', ...]   # array size = 31
['a', 'b', 'c', ... 'z', 'z', 'z', 'z', ...]   # array size = 37
['a', 'b', 'c', ... 'z', 'z', 'z', 'z', ...]   # array size = 41
...

Now, suppose your word is 'geek'. For it you need number X, such that X % 29 == 6, X % 31 == 4, X % 37 == 4, X % 41 == 10. And you can always find such X, as was shown above.

So, if you have alphabet of 26 letters, you can create matrix of width 149 (see the list of primes) and encode any word with it.

like image 169
Nikita Rybak Avatar answered Nov 14 '22 06:11

Nikita Rybak