Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Choosing an alphabet that covers the most words? [closed]

Tags:

algorithm

Given a list of words and an alphabet that has at most P letters, how can we choose the optimal alphabet that covers the most words?

For example: Given words "aaaaaa" "bb" "bb" with P=1, the optimal alphabet is "b" since "b" covers two words.

Another example: Given words "abmm" "abaa" "mnab" "bbcc" "mnnn" with P=4, the optimal alphabet is "abmn", since that covers 4 of the 5 words.

Are there any known algorithms, or can someone suggest an algorithm that solves this problem?

like image 236
Rahul Kumar Avatar asked Oct 11 '14 11:10

Rahul Kumar


2 Answers

This problem is NP-hard by reduction from CLIQUE (it's sort of a densest k-sub(hyper)graph problem). Given a graph, label its vertices with distinct letters and, for each edge, make a two-letter word. There exists a k-clique if and only if we can cover k choose 2 words with k letters.

The algorithm situation even for CLIQUE is grim (running times must be n^Theta(k) under a plausible hypothesis), so I'm not sure what to recommend other than brute force with primitive bit arrays.

like image 126
David Eisenstat Avatar answered Nov 20 '22 10:11

David Eisenstat


I'm not yet sure that this is correct, but hopefully it's at least close. We consider a dynamic programming solution. Enumerate the words 1 through N, the letters in our alphabet 1 through P. We want to be able to solve (n, p) in terms of all sub solutions. We consider several cases.

The simplest case is where the nth word is already in the dictionary given in the solution to (n-1, p). We then count ourselves lucky, up the words covered by one, and leave the dictionary unchanged (ddictionary refers to some subset of letters here).

Suppose instead that the nth word is not in the dictionary given by (n-1, p). Then either the dictionary solving (n-1, p) is the dictionary for (n, p), OR the nth word is in the solution. So we look for solutions that explicitly involve the nth word. So, we add all the letters in the nth word to the dictionary we are considering. We now search through all previous subsolutions of the form (n-1, i), where i is p-1 or less. We are looking for the largest value of i such that |d(n-1, i) U d(n)| <= p. Where d(n-1, i) means the dictionary associated with that solution, and d(n) simply means the dictionary associated with all letters of the nth word. In plain English, we use our subsolutions to find the best solution with a smaller value of p that allows us to fit the new word. Once we have found that value of i, we combine the dictionaries whose magnitude we were measuring. If the magnitude of this set is still not p, we repeat the process described before. When we have created a dictionary with magnitude p that covers the nth word with this technique (or iterated through all previous solutions), we compute its coverage and compare it to the coverage we would get by simply using the dictionary from (n-1, p), and we pick the better. If theres a tie, we pick both.

Im not completely convinced of the correctness of this solution, but I think it may be right. Thoughts?

like image 2
Nir Friedman Avatar answered Nov 20 '22 10:11

Nir Friedman