Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

More elegant way to implement regexp-like quantifiers

I'm writing a simple string parser which allows regexp-like quantifiers. An input string might look like this:

s = "x y{1,2} z"

My parser function translates this string to a list of tuples:

list_of_tuples = [("x", 1, 1), ("y", 1, 2), ("z", 1, 1)]

Now, the tricky bit is that I need a list of all valid combinations that are specified by the quantification. The combinations all have to have the same number of elements, and the value None is used for padding. For the given example, the expected output is

[["x", "y", None, "z"], ["x", "y", "y", "z"]]

I do have a working solution, but I'm not really happy with it: it uses two nested for loops, and I find the code somewhat obscure, so there's something generally awkward and clumsy about it:

import itertools

def permute_input(lot):
    outer = []
    # is there something that replaces these nested loops?
    for val, start, end in lot:
        inner = []
        # For each tuple, create a list of constant length
        # Each element contains a different number of 
        # repetitions of the value of the tuple, padded
        # by the value None if needed.
        for i in range(start, end + 1):
            x = [val] * i + [None] * (end - i)
            inner.append(x)
        outer.append(inner)
    # Outer is now a list of lists.

    final = []
    # use itertools.product to combine the elements in the
    # list of lists:
    for combination in itertools.product(*outer):
        # flatten the elements in the current combination,
        # and append them to the final list:
        final.append([x for x 
                    in itertools.chain.from_iterable(combination)])
    return final

print(permute_input([("x", 1, 1), ("y", 1, 2), ("z", 1, 1)]))
[['x', 'y', None, 'z'], ['x', 'y', 'y', 'z']]

I suspect that there's a much more elegant way of doing this, possibly hidden somewhere in the itertools module?

like image 641
Schmuddi Avatar asked Jan 22 '17 11:01

Schmuddi


2 Answers

One alternative way to approach the problem is to use pyparsing and this example regex parser that would expand a regular expression to possible matching strings. For your x y{1,2} z sample string it would generate two possible strings expanding the quantifier:

$ python -i regex_invert.py 
>>> s = "x y{1,2} z"
>>> for item in invert(s):
...     print(item)
... 
x y z
x yy z

The repetition itself supports both an open-ended range and a closed range and is defined as:

repetition = (
    (lbrace + Word(nums).setResultsName("count") + rbrace) |
    (lbrace + Word(nums).setResultsName("minCount") + "," + Word(nums).setResultsName("maxCount") + rbrace) |
    oneOf(list("*+?"))
)

To get to the desired result, we should modify the way the results are yielded from the recurseList generator and return lists instead of strings:

for s in elist[0].makeGenerator()():
    for s2 in recurseList(elist[1:]):
        yield [s] + [s2]  # instead of yield s + s2

Then, we need to only flatten the result:

$ ipython3 -i regex_invert.py 

In [1]: import collections

In [2]: def flatten(l):
   ...:     for el in l:
   ...:         if isinstance(el, collections.Iterable) and not isinstance(el, (str, bytes)):
   ...:             yield from flatten(el)
   ...:         else:
   ...:             yield el
   ...:             

In [3]: s = "x y{1,2} z"

In [4]: for option in invert(s):
   ...:     print(list(flatten(option)))
   ...: 
['x', ' ', 'y', None, ' ', 'z']
['x', ' ', 'y', 'y', ' ', 'z']

Then, if needed, you can filter the whitespace characters:

In [5]: for option in invert(s):
   ...:     print([item for item in flatten(option) if item != ' '])
   ...:     
['x', 'y', None, 'z']
['x', 'y', 'y', 'z']
like image 68
alecxe Avatar answered Oct 06 '22 08:10

alecxe


Recursive solution (simple, good for up to few thousand tuples):

def permutations(lot):
    if not lot:
        yield []
    else:
        item, start, end = lot[0]
        for prefix_length in range(start, end+1):
            for perm in permutations(lot[1:]):
                yield [item]*prefix_length + [None] * (end - prefix_length) + perm

It is limited by the recursion depth (~1000). If it is not enough, there is a simple optimization for start == end cases. Dependin on the expected size of list_of_tuples it might be enough

Test:

>>> list(permutations(list_of_tuples))  # list() because it's an iterator
[['x', 'y', None, 'z'], ['x', 'y', 'y', 'z']]

Without recursion (universal but less elegant):

def permutations(lot):
    source = []
    cnum = 1  # number of possible combinations
    for item, start, end in lot:  # create full list without Nones
        source += [item] * (end-start+1)
        cnum *= (end-start+1)

    for i in range(cnum):
        bitmask = [True] * len(source)
        state = i
        pos = 0
        for _, start, end in lot:
            state, m = divmod(state, end-start+1)  # m - number of Nones to insert
            pos += end-start+1
            bitmask[pos-m:pos] = [None] * m
        yield [bitmask[i] and c for i, c in enumerate(source)]

The idea behind this solution: actually, we are kind of looking full string (xyyz) though a glass wich adds certain number of None. We can count numer of possible combinations by calculating product of all (end-start+1). Then, we can just number all iterations (simple range loop) and reconstruct this mask from the iteration number. Here we reconstruct the mask by iteratively using divmod on the state number and using remainder as the number of Nones at the symbol position

like image 34
Marat Avatar answered Oct 06 '22 07:10

Marat