How to split text into chunks minimizing the solution?

Question

OVERVIEW

I got a set of possible valid chunks I can use to split a text (if possible).

How can i split a given text using these chunks such as the result will be optimized (minimized) in terms of the number of resulting chunks?

TEST SUITE

if __name__ == "__main__":
    import random
    import sys

    random.seed(1)

    # 1) Testing robustness
    examples = []
    sys.stdout.write("Testing correctness...")
    N = 50
    large_number = "3141592653589793238462643383279502884197169399375105820974944592307816406286208998628034825342117067982148086513282306647093844609550582231725359408128481"
    for i in range(100):
        for j in range(i):
            choices = random.sample(range(i), j)
            examples.append((choices, large_number))

    for (choices, large_number) in examples:
        get_it_done(choices, large_number)
    sys.stdout.write("OK")

    # 2) Testing correctness
    examples = [
        # Example1 ->
        # Solution ['012345678910203040506070', '80', '90', '100', '200', '300', '400', '500', '600', '700', '800', '900']
        (
            [
                "0", "1", "2", "3", "4", "5", "6", "7", "8", "9",
                "10", "20", "30", "40", "50", "60", "70", "80", "90",
                "100", "200", "300", "400", "500", "600", "700", "800", "900",
                "012345678910203040506070"
            ],
            "0123456789102030405060708090100200300400500600700800900"
        ),
        # Example2
        ## Solution ['100']
        (
            ["0", "1", "10", "100"],
            "100"
        ),
        # Example3
        ## Solution ['101234567891020304050', '6070809010020030040050', '0600700800900']
        (
            [
                "10", "20", "30", "40", "50", "60", "70", "80", "90",
                "012345678910203040506070",
                "101234567891020304050",
                "6070809010020030040050",
                "0600700800900"
            ],
            "10123456789102030405060708090100200300400500600700800900"
        ),
        # Example4
        ### Solution ['12', '34', '56', '78', '90']
        (
            [
                "12", "34", "56", "78", "90",
                "890",
            ],
            "1234567890"
        ),
        # Example5
        ## Solution ['12', '34']
        (
            [
                "1", "2", "3",
                "12", "23", "34"
            ],
            "1234"
        ),
        # Example6
        ## Solution ['100', '10']
        (
            ["0", "1", "10", "100"],
            "10010"
        )
    ]

    score = 0
    for (choices, large_number) in examples:
        res = get_it_done(choices, large_number)
        flag = "".join(res) == large_number
        print("{0}
{1}
{2} --> {3}".format(
            large_number, "".join(res), res, flag))
        print('-' * 80)
        score += flag

    print(
        "Score: {0}/{1} = {2:.2f}%".format(score, len(examples), score / len(examples) * 100))

    # 3) TODO: Testing optimization, it should provide (if possible)
    #          minimal cases

QUESTION

How could I solve this problem on python without using a brute-force approach?

Nico Schertler · Accepted Answer

Using dynamic programming, you can construct a list (l0, l1, l2, ... ln-1), where n is the number of characters in your input string and li is the minimum number of chunks you need to arrive at character i of the input string. The overall structure would look as follows:

minValues := list with n infinity entries
for i from 0 to n-1
    for every choice c that is a suffix of input[0..i]
        if i - len(c) < 0
            newVal = 1
        else
            newVal = minValues[i - len(c)] + 1
        end if
        if(newVal < minValues[i])
            minValues[i] = newVal
            //optionally record the used chunk
        end if
    next
next

The minimum number of chunk for your entire string is then ln-1. You can get the actual chunks by tracking back through the list (which requires to record the used chunks).

Retrieving the choices that are suffixes can be sped up using a trie (of the reverse choice strings). The worst case complexity will still be O(n * c * lc), where n is the length of the input string, c is the number of choices, and lc is the maximum length of the choices. However, this complexity will only occur for choices that are nested suffixes (e.g. 0, 10, 010, 0010...). In this case, the trie will degenerate to a list. In average, the run time should be much less. Under the assumption that the number of retrieved choices from the trie is always a small constant, it is O(n * lc) (actually, the lc factor is probably also smaller).

Here is an example:

choices = ["0","1","10","100"]
text = "10010"

algorithm step    content of minValues
                   0      1       2        3      4
---------------------------------------------------------
initialize        (∞,     ∞ ,     ∞ ,      ∞ ,    ∞     )
i = 0, c = "1"    (1 "1", ∞ ,     ∞ ,      ∞ ,    ∞     )
i = 1, c = "0"    (1 "1", 2 "0",  ∞ ,      ∞ ,    ∞     )
i = 1, c = "10"   (1 "1", 1 "10", ∞ ,      ∞ ,    ∞     )
i = 2, c = "0"    (1 "1", 1 "10", 2 "0",   ∞ ,    ∞     )
i = 2, c = "100"  (1 "1", 1 "10", 1 "100", ∞ ,    ∞     )
i = 3, c = "1"    (1 "1", 1 "10", 1 "100", 2 "1", ∞     )
i = 4, c = "0"    (1 "1", 1 "10", 1 "100", 2 "1", 3 "0" )
i = 4, c = "10"   (1 "1", 1 "10", 1 "100", 2 "1", 2 "10")

Meaning: We can compose the string with 2 chunks. Tracing back gives the chunks in reverse order: "10", "100".

How to split text into chunks minimizing the solution?

Tags:

python

string

algorithm

computer-science

split

BPL

1 Answers

Nico Schertler

Recent Activity

Donate For Us

How to split text into chunks minimizing the solution?

Tags:

python

string

algorithm

computer-science

split

BPL

1 Answers

Nico Schertler

Related questions

Recent Activity

Donate For Us