Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reduce a string using grammar-like rules

I'm trying to find a suitable DP algorithm for simplifying a string. For example I have a string a b a b and a list of rules

  1. a b -> b
  2. a b -> c
  3. b a -> a
  4. c c -> b

The purpose is to get all single chars that can be received from the given string using these rules. For this example it will be b, c. The length of the given string can be up to 200 symbols. Could you please prompt an effective algorithm?

Rules always are 2 -> 1. I've got an idea of creating a tree, root is given string and each child is a string after one transform, but I'm not sure if it's the best way.

like image 403
user2875945 Avatar asked May 08 '15 20:05

user2875945


3 Answers

If you read those rules from right to left, they look exactly like the rules of a context free grammar, and have basically the same meaning. You could apply a bottom-up parsing algorithm like the Earley algorithm to your data, along with a suitable starting rule; something like

start <- start a
       | start b
       | start c

and then just examine the parse forest for the shortest chain of starts. The worst case remains O(n^3) of course, but Earley is fairly effective, these days.

You can also produce parse forests when parsing with derivatives. You might be able to efficiently check them for short chains of starts.

like image 148
Jay Kominek Avatar answered Nov 19 '22 06:11

Jay Kominek


For a DP problem, you always need to understand how you can construct the answer for a big problem in terms of smaller sub-problems. Assume you have your function simplify which is called with an input of length n. There are n-1 ways to split the input in a first and a last part. For each of these splits, you should recursively call your simplify function on both the first part and the last part. The final answer for the input of length n is the set of all possible combinations of answers for the first and for the last part, which are allowed by the rules.

In Python, this can be implemented like so:

rules = {'ab': set('bc'), 'ba': set('a'), 'cc': set('b')}
all_chars = set(c for cc in rules.values() for c in cc)

@ memoize
def simplify(s):
    if len(s) == 1:  # base case to end recursion
        return set(s)

    possible_chars = set()

    # iterate over all the possible splits of s
    for i in range(1, len(s)):
        head = s[:i]
        tail = s[i:]

        # check all possible combinations of answers of sub-problems
        for c1 in simplify(head):
            for c2 in simplify(tail):
                possible_chars.update(rules.get(c1+c2, set()))

                # speed hack
                if possible_chars == all_chars: #  won't get any bigger
                    return all_chars

    return possible_chars

Quick check:

In [53]: simplify('abab')
Out[53]: {'b', 'c'}

To make this fast enough for large strings (to avoiding exponential behavior), you should use a memoize decorator. This is a critical step in solving DP problems, otherwise you are just doing a brute-force calculation. A further tiny speedup can be obtained by returning from the function as soon as possible_chars == set('abc'), since at that point, you are already sure that you can generate all possible outcomes.

Analysis of running time: for an input of length n, there are 2 substrings of length n-1, 3 substrings of length n-2, ... n substrings of length 1, for a total of O(n^2) subproblems. Due to the memoization, the function is called at most once for every subproblem. Maximum running time for a single sub-problem is O(n) due to the for i in range(len(s)), so the overall running time is at most O(n^3).

like image 30
Bas Swinckels Avatar answered Nov 19 '22 07:11

Bas Swinckels


Let N - length of given string and R - number of rules.

Expanding a tree in a top down manner yields computational complexity O(NR^N) in the worst case (input string of type aaa... and rules aa -> a).

Proof:

Root of the tree has (N-1)R children, which have (N-1)R^2 children, ..., which have (N-1)R^N children (leafs). So, the total complexity is O((N-1)R + (N-1)R^2 + ... (N-1)R^N) = O(N(1 + R^2 + ... + R^N)) = (using binomial theorem) = O(N(R+1)^N) = O(NR^N).

Recursive Java implementation of this naive approach:

public static void main(String[] args) {
    Map<String, Character[]> rules = new HashMap<String, Character[]>() {{
        put("ab", new Character[]{'b', 'c'});
        put("ba", new Character[]{'a'});
        put("cc", new Character[]{'b'});
    }};
    System.out.println(simplify("abab", rules));
}

public static Set<String> simplify(String in, Map<String, Character[]> rules) {
    Set<String> result = new HashSet<String>();
    simplify(in, rules, result);
    return result;
}

private static void simplify(String in, Map<String, Character[]> rules, Set<String> result) {
    if (in.length() == 1) {
        result.add(in);
    }
    for (int i = 0; i < in.length() - 1; i++) {
        String two = in.substring(i, i + 2);
        Character[] rep = rules.get(two);
        if (rep != null) {
            for (Character c : rep) {
                simplify(in.substring(0, i) + c + in.substring(i + 2, in.length()), rules, result);
            }
        }
    }
}

Bas Swinckels's O(RN^3) Java implementation (with HashMap as a memoization cache):

public static Set<String> simplify2(final String in, Map<String, Character[]> rules) {
    Map<String, Set<String>> cache = new HashMap<String, Set<String>>();
    return simplify2(in, rules, cache);
}

private static Set<String> simplify2(final String in, Map<String, Character[]> rules, Map<String, Set<String>> cache) {
    final Set<String> cached = cache.get(in);
    if (cached != null) {
        return cached;
    }
    Set<String> ret = new HashSet<String>();
    if (in.length() == 1) {
        ret.add(in);
        return ret;
    }
    for (int i = 1; i < in.length(); i++) {
        String head = in.substring(0, i);
        String tail = in.substring(i, in.length());
        for (String c1 : simplify2(head, rules)) {
            for (String c2 : simplify2(tail, rules, cache)) {
                Character[] rep = rules.get(c1 + c2);
                if (rep != null) {
                    for (Character c : rep) {
                        ret.add(c.toString());
                    }
                }
            }
        }
    }
    cache.put(in, ret);
    return ret;
}

Output in both approaches:

[b, c]
like image 1
Adam Stelmaszczyk Avatar answered Nov 19 '22 06:11

Adam Stelmaszczyk