Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Itertools zip_longest with first item of each sub-list as padding values in stead of None by default

I have this list of lists :

cont_det = [['TASU 117000 0', "TGHU 759933 - 0", 'CSQU3054383', 'BMOU 126 780-0', "HALU 2014 13 3"], ['40HS'], ['Ha2ardous Materials', 'Arm5 Maehinery']]

Practically cont_det is a huge list with lots of sub-lists with irregular length of each sub-list. This is just a sample case for demonstration. I want to get the following output :

[['TASU 117000 0', '40HS', 'Ha2ardous Materials'], 
 ['TGHU 759933 - 0', '40HS', 'Arm5 Maehinery'], 
 ['CSQU3054383', '40HS', 'Ha2ardous Materials'], 
 ['BMOU 126 780-0', '40HS', 'Ha2ardous Materials'], 
 ['HALU 2014 13 3', '40HS', 'Ha2ardous Materials']]

The logic behind this is zip_longest the list of lists but in case there is any sub-list whose length is less than the maximum of all lengths of the sub-lists (which is 5 here for first sub-list), then in stead of default fillvalue=None take the first item of that sub-list - as seen in case of second sub-list all reflected filled values are same and for the third one, the last three are filled by the first value.

I have got the result with this code :

from itertools import zip_longest as zilo
from more_itertools import padded as pad
max_ = len(max(cont_det, key=len))
for i, cont_row in enumerate(cont_det):
    if len(cont_det)!=max_:
        cont_det[i] = list(pad(cont_row, cont_row[0], max_))
cont_det = list(map(list, list(zilo(*cont_det))))

This gives me the expected result. In stead had I done list(zilo(*cont_det, fillvalue='')) I would have gotten this :

[('TASU 117000 0', '40HS', 'Ha2ardous Materials'), 
 ('TGHU 759933 - 0', '', 'Arm5 Maehinery'), 
 ('CSQU3054383', '', ''), 
 ('BMOU 126 780-0', '', ''), 
 ('HALU 2014 13 3', '', '')]

Is there any other process (like mapping any function or so) to the parameter fillvalue of the zip_longest function so that I don't have to iterate through the list to pad each sub-list up to the length of the longest sub-list before that and this thing can be done in a line with only zip_longest?

like image 436
Arkistarvh Kltzuonstev Avatar asked Dec 10 '19 21:12

Arkistarvh Kltzuonstev


2 Answers

You can peek into each of the iterators via next in order to extract the first item ("head"), then create a sentinel object that marks the end of the iterator and finally chain everything back together in the following way: head -> remainder_of_iterator -> sentinel -> it.repeat(head).

This uses it.repeat to replay the first item ad infinitum once the end of the iterator has been reached, so we need to introduce a way to stop that process once the last iterator hits its sentinel object. For this we can (ab)use the fact that map stops iterating if the mapped function raises (or leaks) a StopIteration, such as from next invoked on an already exhausted iterator. Alternatively we can use the 2-argument form of iter to stop on a sentinel object (see below).

So we can map the chained iterators over a function that checks for each item whether it is sentinel and performs the following steps:

  1. if item is sentinel then consume a dedicated iterator that yields one item fewer than the total number of iterators via next (hence leaking StopIteration for the last sentinel) and replace the sentinel with the corresponding head.
  2. else just return the original item.

Finally we can just zip the iterators together - it will stop on the last one hitting its sentinel object, i.e. performing a "zip-longest".

In summary, the following function performs the steps described above:

import itertools as it


def solution(*iterables):
    iterators = [iter(i) for i in iterables]  # make sure we're operating on iterators
    heads = [next(i) for i in iterators]  # requires each of the iterables to be non-empty
    sentinel = object()
    iterators = [it.chain((head,), iterator, (sentinel,), it.repeat(head))
                 for iterator, head in zip(iterators, heads)]
    # Create a dedicated iterator object that will be consumed each time a 'sentinel' object is found.
    # For the sentinel corresponding to the last iterator in 'iterators' this will leak a StopIteration.
    running = it.repeat(None, len(iterators) - 1)
    iterators = [map(lambda x, h: next(running) or h if x is sentinel else x,  # StopIteration causes the map to stop iterating
                     iterator, it.repeat(head))
                 for iterator, head in zip(iterators, heads)]
    return zip(*iterators)

If leaking StopIteration from the mapped function in order to terminate the map iterator feels too awkward then we can slightly modify the definition of running to yield an additional sentinel and use the 2-argument form of iter in order to stop on sentinel:

running = it.chain(it.repeat(None, len(iterators) - 1), (sentinel,))
iterators = [...]  # here the conversion to map objects remains unchanged
return zip(*[iter(i.__next__, sentinel) for i in iterators])

If the name resolution for sentinel and running from inside the mapped function is a concern, they can be included as arguments to that function:

iterators = [map(lambda x, h, s, r: next(r) or h if x is s else x,
                 iterator, it.repeat(head), it.repeat(sentinel), it.repeat(running))
             for iterator, head in zip(iterators, heads)]
like image 104
a_guest Avatar answered Nov 12 '22 20:11

a_guest


That looks like some sort of "matrix rotation".

I've done it without any libs used to make it clear for everybody. That's pretty easy as for me.

from pprint import pprint

cont_det = [
    ['TASU 117000 0', "TGHU 759933 - 0", 'CSQU3054383', 'BMOU 126 780-0', "HALU 2014 13 3"],
    ['40HS'],
    ['Ha2ardous Materials', 'Arm5 Maehinery'],
]


def rotate_matrix(source):
    result = []

    # let's find the longest sub-list length
    length = max((len(row) for row in source))

    # for every column in sub-lists create a new row in the resulting list
    for column_id in range(0, length):
        result.append([])

        # let's fill the new created row using source row columns data.
        for row_id in range(0, len(source)):
            # let's use the first value from the sublist values if source row list has it for the column_id
            if len(source[row_id]) > column_id:
                result[column_id].append(source[row_id][column_id])
            else:
                try:
                    result[column_id].append(source[row_id][0])
                except IndexError:
                    result[column_id].append(None)

    return result


pprint(rotate_matrix(cont_det))

And, of course, the script output


> python test123.py
[['TASU 117000 0', '40HS', 'Ha2ardous Materials'],
 ['TGHU 759933 - 0', '40HS', 'Arm5 Maehinery'],
 ['CSQU3054383', '40HS', 'Ha2ardous Materials'],
 ['BMOU 126 780-0', '40HS', 'Ha2ardous Materials'],
 ['HALU 2014 13 3', '40HS', 'Ha2ardous Materials']]

Can't understand about zip_longest function. Is it a requirement for the solution or you need a solution "which just works" :) Because it doesn't look like zip_longest supports any sort of callbacks or etc where we can return required value "per cell" in the matrix.

like image 1
Alexandr Shurigin Avatar answered Nov 12 '22 20:11

Alexandr Shurigin