Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding substrings of string such that product of the length of the substring with its number of occurrences is maximized

I was thinking of the following problem: Given a string S, let the length of the ith substring be li and number of occurrences of the ith substring be oi. Print the substring such that li*oi is maximized.

I have O(n3) solution (brute force) for this problem where I am generating all the substrings and finding the substring with maximum value. My code for the same is as follows:

public static void solve(String S) {
    long max = Integer.MIN_VALUE;
    String res = "";
    for (int i = 0; i < S.length(); i++) {
        for (int j = 1; j <= S.length() - i; j++) {
            String s = S.substring(i, i + j);
            int o = countOccurrences(S, s);
            long p = (long) o * (long) s.length();
            if (max < p) {
                max = p;
                res = s;
            }
        }
    }
    System.out.println(res);
}

where countOccurrences() method takes O(n) time. I was wondering if there was a more efficient way to achieve this.

like image 914
n00bc0d3r Avatar asked Oct 21 '22 05:10

n00bc0d3r


1 Answers

Here's a linear-time algorithm:

  1. Build a suffix tree on the input string. This takes O(n) time and space.
  2. Traverse the suffix tree in postorder DFS, calculating the number of descendants for each node by summing the values of its children. As soon as this quantity is known for a node, multiply it with its string length (which is the sum of the length of all edges from the root) and update the best-so-far total if necessary. This also takes O(n) time.

The key points are that

  • A suffix tree contains only a linear number of internal nodes, and
  • Any substring that does not correspond to an internal node cannot produce a maximal score. This is because as you trace it from the suffix tree root it must reach "partway down" some edge -- but you can always extend it further without reducing the number of occurrences (which is the number of descendants), and thus increase the score, by continuing on down to the next node.

It might also be possible to do this using suffix arrays instead of suffix trees, in which case it's likely to require a constant factor less memory, but add a logarithmic factor to the running time.

like image 199
j_random_hacker Avatar answered Oct 23 '22 01:10

j_random_hacker