Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split text into chunks minimizing the solution?

OVERVIEW

I got a set of possible valid chunks I can use to split a text (if possible).

How can i split a given text using these chunks such as the result will be optimized (minimized) in terms of the number of resulting chunks?

TEST SUITE

if __name__ == "__main__":
    import random
    import sys

    random.seed(1)

    # 1) Testing robustness
    examples = []
    sys.stdout.write("Testing correctness...")
    N = 50
    large_number = "3141592653589793238462643383279502884197169399375105820974944592307816406286208998628034825342117067982148086513282306647093844609550582231725359408128481"
    for i in range(100):
        for j in range(i):
            choices = random.sample(range(i), j)
            examples.append((choices, large_number))

    for (choices, large_number) in examples:
        get_it_done(choices, large_number)
    sys.stdout.write("OK")

    # 2) Testing correctness
    examples = [
        # Example1 ->
        # Solution ['012345678910203040506070', '80', '90', '100', '200', '300', '400', '500', '600', '700', '800', '900']
        (
            [
                "0", "1", "2", "3", "4", "5", "6", "7", "8", "9",
                "10", "20", "30", "40", "50", "60", "70", "80", "90",
                "100", "200", "300", "400", "500", "600", "700", "800", "900",
                "012345678910203040506070"
            ],
            "0123456789102030405060708090100200300400500600700800900"
        ),
        # Example2
        ## Solution ['100']
        (
            ["0", "1", "10", "100"],
            "100"
        ),
        # Example3
        ## Solution ['101234567891020304050', '6070809010020030040050', '0600700800900']
        (
            [
                "10", "20", "30", "40", "50", "60", "70", "80", "90",
                "012345678910203040506070",
                "101234567891020304050",
                "6070809010020030040050",
                "0600700800900"
            ],
            "10123456789102030405060708090100200300400500600700800900"
        ),
        # Example4
        ### Solution ['12', '34', '56', '78', '90']
        (
            [
                "12", "34", "56", "78", "90",
                "890",
            ],
            "1234567890"
        ),
        # Example5
        ## Solution ['12', '34']
        (
            [
                "1", "2", "3",
                "12", "23", "34"
            ],
            "1234"
        ),
        # Example6
        ## Solution ['100', '10']
        (
            ["0", "1", "10", "100"],
            "10010"
        )
    ]

    score = 0
    for (choices, large_number) in examples:
        res = get_it_done(choices, large_number)
        flag = "".join(res) == large_number
        print("{0}\n{1}\n{2} --> {3}".format(
            large_number, "".join(res), res, flag))
        print('-' * 80)
        score += flag

    print(
        "Score: {0}/{1} = {2:.2f}%".format(score, len(examples), score / len(examples) * 100))

    # 3) TODO: Testing optimization, it should provide (if possible)
    #          minimal cases

QUESTION

How could I solve this problem on python without using a brute-force approach?

like image 843
BPL Avatar asked Sep 28 '16 14:09

BPL


1 Answers

Using dynamic programming, you can construct a list (l0, l1, l2, ... ln-1), where n is the number of characters in your input string and li is the minimum number of chunks you need to arrive at character i of the input string. The overall structure would look as follows:

minValues := list with n infinity entries
for i from 0 to n-1
    for every choice c that is a suffix of input[0..i]
        if i - len(c) < 0
            newVal = 1
        else
            newVal = minValues[i - len(c)] + 1
        end if
        if(newVal < minValues[i])
            minValues[i] = newVal
            //optionally record the used chunk
        end if
    next
next

The minimum number of chunk for your entire string is then ln-1. You can get the actual chunks by tracking back through the list (which requires to record the used chunks).

Retrieving the choices that are suffixes can be sped up using a trie (of the reverse choice strings). The worst case complexity will still be O(n * c * lc), where n is the length of the input string, c is the number of choices, and lc is the maximum length of the choices. However, this complexity will only occur for choices that are nested suffixes (e.g. 0, 10, 010, 0010...). In this case, the trie will degenerate to a list. In average, the run time should be much less. Under the assumption that the number of retrieved choices from the trie is always a small constant, it is O(n * lc) (actually, the lc factor is probably also smaller).

Here is an example:

choices = ["0","1","10","100"]
text = "10010"

algorithm step    content of minValues
                   0      1       2        3      4
---------------------------------------------------------
initialize        (∞,     ∞ ,     ∞ ,      ∞ ,    ∞     )
i = 0, c = "1"    (1 "1", ∞ ,     ∞ ,      ∞ ,    ∞     )
i = 1, c = "0"    (1 "1", 2 "0",  ∞ ,      ∞ ,    ∞     )
i = 1, c = "10"   (1 "1", 1 "10", ∞ ,      ∞ ,    ∞     )
i = 2, c = "0"    (1 "1", 1 "10", 2 "0",   ∞ ,    ∞     )
i = 2, c = "100"  (1 "1", 1 "10", 1 "100", ∞ ,    ∞     )
i = 3, c = "1"    (1 "1", 1 "10", 1 "100", 2 "1", ∞     )
i = 4, c = "0"    (1 "1", 1 "10", 1 "100", 2 "1", 3 "0" )
i = 4, c = "10"   (1 "1", 1 "10", 1 "100", 2 "1", 2 "10")

Meaning: We can compose the string with 2 chunks. Tracing back gives the chunks in reverse order: "10", "100".

like image 128
Nico Schertler Avatar answered Oct 31 '22 03:10

Nico Schertler