Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compute the maximum number of runs possible for a given length string

A few weeks ago Lembik asked the following question:

A period p of a string w is any positive integer p such that w[i]=w[i+p] whenever both sides of this equation are defined. Let per(w) denote the size of the smallest period of w . We say that a string w is periodic iff per(w) <= |w|/2.

So informally a periodic string is just a string that is made up from another string repeated at least once. The only complication is that at the end of the string we don't require a full copy of the repeated string as long as it is repeated in its entirety at least once.

For, example consider the string x = abcab. per(abcab) = 3 as x[1] = x[1+3] = a, x[2]=x[2+3] = b and there is no smaller period. The string abcab is therefore not periodic. However, the string ababa is periodic as per(ababa) = 2.

As more examples, abcabca, ababababa and abcabcabc are also periodic.

For those who like regexes, this one detects if a string is periodic or not:

\b(\w*)(\w+\1)\2+\b

The task is to find all maximal periodic substrings in a longer string. These are sometimes called runs in the literature.

A substring w[i,j] of w is a maximal periodic substring (run) if it is periodic and neither w[i-1] = w[i-1+p] nor w[j+1] = w[j+1-p]. Informally, the "run" cannot be contained in a larger "run" with the same period.

Because two runs can represent the same string of characters occurring in different places in the overall string, we will represent runs by intervals. Here is the above definition repeated in terms of intervals.

A run (or maximal periodic substring) in a string T is an interval [i...j] with j>=i, such that

  • T[i...j] is a periodic word with the period p = per(T[i...j])
  • It is maximal. Formally, neither T[i-1] = T[i-1+p] nor T[j+1] = T[j+1-p]. Informally, the run cannot be contained in a larger run with the same period.

Denote by RUNS(T) the set of runs in string T.

Examples of runs

  • The four maximal periodic substrings (runs) in string T = atattatt are T[4,5] = tt, T[7,8] = tt, T[1,4] = atat, T[2,8] = tattatt.

  • The string T = aabaabaaaacaacac contains the following 7 maximal periodic substrings (runs): T[1,2] = aa, T[4,5] = aa, T[7,10] = aaaa, T[12,13] = aa, T[13,16] = acac, T[1,8] = aabaabaa, T[9,15] = aacaaca.

  • The string T = atatbatatb contains the following three runs. They are: T[1, 4] = atat, T[6, 9] = atat and T[1, 10] = atatbatatb.

Here I am using 1-indexing.

The goal

Write code so that for each integer n starting at 2, you output the largest numbers of runs contained in any binary string of length n.

Example optima

In the following: n, optimum number of runs, example string.

2 1 00
3 1 000
4 2 0011
5 2 00011
6 3 001001
7 4 0010011
8 5 00110011
9 5 000110011
10 6 0010011001
11 7 00100110011
12 8 001001100100
13 8 0001001100100
14 10 00100110010011
15 10 000100110010011
16 11 0010011001001100
17 12 00100101101001011
18 13 001001100100110011
19 14 0010011001001100100

Is there a faster way to find the optimum number of runs for increasing values of n than a naive O(n^2 2^n) time approach?

like image 287
graffe Avatar asked Jul 17 '16 20:07

graffe


1 Answers

A generational algorithm to find all solutions

The Idea

In every string the last character can only contribute to a limited number of runs.

A last 0 can only add a run to

10 + 0 => 100

since in

00 + 0 => 000

it's only a repetition. Anf if it adds that minimal run, the next possible minimal run to add is

110010 + 0 => 1100100

note again

010010 + 0 => 0100100

is not an additional run, it's a repetition. The next possible adds are

111001001100100
1111001001100100111001001100100
...

The digits can vary but the minimal lengths are

3, 7, 15, 31

which is

4^1 - 1, 4^2 - 1, ..., 4^n - 1

At string start there is no need for a different character, thus

maxaddlast = 4^n - 2

yields the maximum number of runs which can be added with by adding the last character.

The Algorithm

  • While the search for length n is done, all variants are recorded with a run count in [maxNumberOfRuns - maxaddlast(n+1), maxNumberOfRuns].
  • To find a solution with maxNumberOfRuns for n+1 simply all recorded variants are extended with 0 and 1 and checked.

The Seed

The remaining problem is to size the stack to collect all variants which are required for future seeds.

Since there is not enough data to guess a valid formula an adaptive algorithm is choosen:

  1. The initial stack size for n is guessed from n - 1
  2. For every solution the used stack sizes are checked that there is always is room by 1 in the stack.
  3. If the stack was fully used at some n, the stack size is increased and the computation is restarted at n.

The Result

length 104 with 91 runs

is reached within 600 seconds. Then the memory is used up with default settings. Use -Xmx16G or more. For larger numbers the code must be modified to maintain the seed on disk, not in memory.

And it's lot faster than the brute force method.

** The Code **

And here is my example code in Java:

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.FileWriter;
import java.util.ArrayList;

import de.bb.util.Pair;

/**
 * A search algorithm to find all runs for increasing lengths of strings of 0s
 * and 1s.
 * 
 * This algorithm uses a seed to generate the candidates for the next search.
 * The seed contains the solutions for rho(n), rho(n) - 1, ..., minstart(n).
 * Since the seed size is unknown, it starts with a minimal seed: minstart(n) =
 * rho(n) - 1; After the solutions are calculated the all seeds are checked. If
 * a seed with minstart(n) was used, that minstart(n) gets decremented and the
 * search is restarted at position n + 1. This guarantees that the seed is
 * always large enough.
 * 
 * Optional TODO: Since the seed can occupy large amounts of memory, the seed is
 * maintained on disk.
 * 
 * @author Stefan "Bebbo" Franke (c) 2016
 */
public class MaxNumberOfRunsAdaptive {
    private static long start;

    private ArrayList<Pair<byte[], ArrayList<Integer>>> seed = new ArrayList<>();
    private int max;
    private ArrayList<ArrayList<Pair<byte[], ArrayList<Integer>>>> nextSeedStack;

    private ArrayList<Integer> maxs = new ArrayList<>();
    private ArrayList<Integer> diffs = new ArrayList<>();
    private ArrayList<Integer> totals = new ArrayList<>();
    private int total;

    private byte[] buffer;

    public static void main(String[] args) {
        int limit = 9999;
        if (args.length == 1) {
            try {
                limit = Integer.parseInt(args[0]);
            } catch (Exception e) {
            }
        }
        start = System.currentTimeMillis();
        new MaxNumberOfRunsAdaptive().run(limit);
        long took = (System.currentTimeMillis() - start) / 100;
        System.out.println("took " + (took / 10.) + "s");
    }

    /**
     * Find a string with the max number of runs for all lengths from 2 to
     * limit;
     * 
     * @param limit
     *            the limit to stop calculation.
     */
    private void run(int limit) {
        maxs.add(0);
        maxs.add(0);
        diffs.add(0);
        diffs.add(1);
        totals.add(0);
        totals.add(0);

        ArrayList<Integer> n0 = new ArrayList<Integer>();
        n0.add(0);
        seed.add(Pair.makePair(new byte[] { '0' }, n0));
        saveSeed(2);

        for (int i = 2; i <= limit;) {
            int restart = compose(i);
            if (restart < i) {
                System.out.println("*** restarting at: " + restart + " ***");
                i = restart;
                loadSeed(i);
                total = totals.get(i - 1);
            } else {
                saveSeed(i + 1);
                ++i;
            }
        }
    }

    /**
     * Load the seed for the length from disk.
     * 
     * @param length
     */
    private void loadSeed(int length) {
        try {
            seed.clear();
            final FileReader fr = new FileReader("seed-" + length + ".txt");
            final BufferedReader br = new BufferedReader(fr);
            for (String line = br.readLine(); line != null; line = br.readLine()) {
                final int space = line.indexOf(' ');
                final byte[] b = line.substring(0, space).getBytes();
                final String sends = line.substring(space + 2, line.length() - 1);
                final ArrayList<Integer> ends = new ArrayList<>();
                for (final String s : sends.split(",")) {
                    ends.add(Integer.parseInt(s.trim()));
                }
                seed.add(Pair.makePair(b, ends));
            }
            fr.close();
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }

    /**
     * Save the seed for the given length to the disk.
     * 
     * @param length
     *            the length
     */
    private void saveSeed(int length) {
        try {
            final FileWriter fos = new FileWriter("seed-" + length + ".txt");
            for (final Pair<byte[], ArrayList<Integer>> p : seed) {
                fos.write(new String(p.getFirst()) + " " + p.getSecond().toString() + "\n");
            }
            fos.close();
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }

    /**
     * Compose new strings from all available bases. Also collect the candidates
     * for the next base.
     */
    private int compose(int length) {
        max = 0;

        int nextStackSize;
        if (diffs.size() > length)
            nextStackSize = diffs.get(length) + 1;
        else
            nextStackSize = diffs.get(length - 1) - 1;
        if (nextStackSize < 2)
            nextStackSize = 2;

        // setup collector for next bases
        nextSeedStack = new ArrayList<>();
        for (int i = 0; i < nextStackSize; ++i) {
            nextSeedStack.add(new ArrayList<Pair<byte[], ArrayList<Integer>>>());
        }

        buffer = new byte[length];
        // extend the bases
        for (Pair<byte[], ArrayList<Integer>> e : seed) {
            final byte[] s = e.getFirst();
            System.arraycopy(s, 0, buffer, 0, length - 1);
            if (s.length < 3 || s[s.length - 1] == '1' || s[s.length - 2] == '1' || s[s.length - 3] == '1') {
                buffer[length - 1] = '0';
                test(length, e.getSecond());
            }
            if (s.length < 3 || s[s.length - 1] == '0' || s[s.length - 2] == '0' || s[s.length - 3] == '0') {
                buffer[length - 1] = '1';
                test(length, e.getSecond());
            }
        }
        long took = (System.currentTimeMillis() - start) / 100;

        final ArrayList<String> solutions = new ArrayList<String>();
        for (Pair<byte[], ArrayList<Integer>> p : nextSeedStack.get(nextSeedStack.size() - 1)) {
            solutions.add(new String(p.getFirst()));
        }
        total += solutions.size();
        if (totals.size() <= length)
            totals.add(0);
        totals.set(length, total);

        if (maxs.size() <= length) {
            maxs.add(0);
        }
        maxs.set(length, max);

        System.out.println(length + " " + max + " " + (took / 10.) + " " + total + " " + solutions);

        seed.clear();
        // setup base for next level
        for (ArrayList<Pair<byte[], ArrayList<Integer>>> t : nextSeedStack) {
            seed.addAll(t);
        }

        if (diffs.size() <= length) {
            diffs.add(1);
        }
        int restart = length;
        // check for restart
        for (final String b : solutions) {
            for (int i = 2; i < b.length(); ++i) {
                int diff = maxs.get(i) - countRuns(b.substring(0, i));
                if (diff >= diffs.get(i)) {
                    if (i < restart)
                        restart = i;
                    diffs.set(i, diff + 1);
                }
            }
        }
        System.out.println(diffs);

        return restart;
    }

    /**
     * Test the current buffer and at it to the next seed stack,
     * 
     * @param l
     *            the current length
     * @param endRuns
     *            the end runs to store
     */
    void test(final int l, final ArrayList<Integer> endRuns) {
        final ArrayList<Integer> r = incrementalCountRuns(l, endRuns);
        final int n = r.get(r.size() - 1);

        // shift the nextBaseStack
        while (max < n) {
            nextSeedStack.remove(0);
            nextSeedStack.add(new ArrayList<Pair<byte[], ArrayList<Integer>>>());
            ++max;
        }

        // add to set in stack, if in stack
        final int index = nextSeedStack.size() - 1 - max + n;
        if (index >= 0)
            nextSeedStack.get(index).add(Pair.makePair(buffer.clone(), r));
    }

    /**
     * Find incremental the runs incremental.
     * 
     * @param l
     *            the lengths
     * @param endRuns
     *            the runs of length-1 ending at length -1
     * @return a new array containing the end runs plus the length
     */
    private ArrayList<Integer> incrementalCountRuns(final int l, final ArrayList<Integer> endRuns) {
        final ArrayList<Integer> res = new ArrayList<Integer>();
        int sz = endRuns.size();
        // last end run dummy - contains the run count
        int n = endRuns.get(--sz);
        int pos = 0;

        for (int i = l - 2; i >= 0; i -= 2) {
            int p = (l - i) / 2;
            // found something ?
            if (equals(buffer, i, buffer, i + p, p)) {
                while (i > 0 && buffer[i - 1] == buffer[i - 1 + p]) {
                    --i;
                }
                int lasti = -1;

                while (pos < sz) {
                    lasti = endRuns.get(pos);
                    if (lasti <= i)
                        break;
                    lasti = -1;
                    ++pos;
                }
                if (lasti != i)
                    ++n;

                res.add(i);
            }
        }

        res.add(n);
        return res;

    }

    /**
     * Compares one segment of a byte array with a segment of a 2nd byte array.
     * 
     * @param a
     *            first byte array
     * @param aOff
     *            offset into first byte array
     * @param b
     *            second byte array
     * @param bOff
     *            offset into second byte array
     * @param len
     *            length of the compared segments
     * @return true if the segments are equal, otherwise false
     */
    public final static boolean equals(byte a[], int aOff, byte b[], int bOff, int len) {
        if (a == null || b == null)
            return a == b;
        while (len-- > 0)
            if (a[aOff + len] != b[bOff + len])
                return false;
        return true;
    }

    /**
     * Simple slow stupid method to count the runs in a String.
     * 
     * @param s
     *            the string
     * @return the count of runs.
     */
    static int countRuns(String s) {
        int n = 0;
        int l = s.length();
        for (int i = 0; i < l - 1; ++i) {
            for (int k = i + 1; k < l; ++k) {
                int p = 0;
                while (i + p < k && k + p < l) {
                    if (s.charAt(i + p) != s.charAt(k + p))
                        break;
                    ++p;
                }
                if (i + p == k) {
                    int jj = k + p - 1;
                    if (i > 0 && s.charAt(i - 1) == s.charAt(i - 1 + p)) {
                        continue;
                    }
                    while (jj + 1 < l && s.charAt(jj + 1) == s.charAt(jj + 1 - p)) {
                        ++jj;
                        ++k;
                    }
                    ++n;
                }
            }
        }
        return n;
    }
}
like image 181
bebbo Avatar answered Oct 14 '22 22:10

bebbo