Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

zip_longest for the left list always

I know about the zip function (which will zip according to the shortest list) and zip_longest (which will zip according to the longest list), but how would I zip according to the first list, regardless of whether it's the longest or not?

For example:

Input:  ['a', 'b', 'c'], [1, 2]
Output: [('a', 1), ('b', 2), ('c', None)]

But also:

Input:  ['a', 'b'], [1, 2, 3]
Output: [('a', 1), ('b', 2)]

Do both of these functionalities exist in one function?

like image 942
NFeruch - FreePalestine Avatar asked Aug 31 '25 10:08

NFeruch - FreePalestine


2 Answers

Solutions

Chaining the repeated fillvalue behind the iterables other than the first:

from itertools import chain, repeat

def zip_first(first, *rest, fillvalue=None):
    return zip(first, *map(chain, rest, repeat(repeat(fillvalue))))

Or using zip_longest and trim it with a compress and zip trick:

def zip_first(first, *rest, fillvalue=None):
    a, b = tee(first)
    return compress(zip_longest(b, *rest, fillvalue=fillvalue), zip(a))

Just like zip and zip_longest, these take any number (well, at least one) of any kind of iterables (including infinite ones) and return an iterator (convert to list if needed).

Benchmark results

Benchmarks with other equally general solutions (all code is at the end of the answer):

10 iterables of 10,000 to 90,000 elements, first has 50,000:
────────────────────────────────────────────────────────────
 2.2 ms   2.2 ms   2.3 ms  limit_cheat
 2.6 ms   2.6 ms   2.6 ms  Kelly_Bundy_chain
 3.3 ms   3.3 ms   3.3 ms  Kelly_Bundy_compress
50.2 ms  50.6 ms  50.7 ms  CrazyChucky
54.7 ms  55.0 ms  55.0 ms  Sven_Marnach
74.8 ms  74.9 ms  75.0 ms  Mad_Physicist
 5.4 ms   5.4 ms   5.4 ms  Kelly_Bundy_3
 5.9 ms   6.0 ms   6.0 ms  Kelly_Bundy_4
 4.6 ms   4.7 ms   4.7 ms  Kelly_Bundy_5

10,000 iterables of 0 to 100 elements, first has 50:
────────────────────────────────────────────────────
 4.6 ms   4.7 ms   4.8 ms  limit_cheat
 4.8 ms   4.8 ms   4.8 ms  Kelly_Bundy_compress
 8.4 ms   8.4 ms   8.4 ms  Kelly_Bundy_chain
27.1 ms  27.3 ms  27.5 ms  CrazyChucky
38.3 ms  38.5 ms  38.7 ms  Sven_Marnach
73.0 ms  73.0 ms  73.1 ms  Mad_Physicist
 4.9 ms   4.9 ms   5.0 ms  Kelly_Bundy_3
 4.9 ms   4.9 ms   5.0 ms  Kelly_Bundy_4
 5.0 ms   5.0 ms   5.0 ms  Kelly_Bundy_5

The first one is a cheat that knows the length, included to show what's probably a limit for how fast we can get.

Explanations

A little explanation of the above two solutions:

The first solution, if used with for example three iterables, is equivalent to this:

def zip_first(first, second, third, fillvalue=None):
    filler = repeat(fillvalue)
    return zip(first,
               chain(second, filler),
               chain(third, filler))

The second solution basically lets zip_longest do the job. The only problem with that is that it doesn't stop when the first iterable is done. So I duplicate the first iterable (with tee) and then use one for its elements and the other for its length. The zip(a) wraps every element in a 1-tuple, and non-empty tuples are true. So compress gives me all tuples produced by zip_longest, as many as there are elements in the first iterable.

Benchmark code (Try it online!)

def limit_cheat(*iterables, fillvalue=None):
    return islice(zip_longest(*iterables, fillvalue=fillvalue), cheat_length)

def Kelly_Bundy_chain(first, *rest, fillvalue=None):
    return zip(first, *map(chain, rest, repeat(repeat(fillvalue))))

def Kelly_Bundy_compress(first, *rest, fillvalue=None):
    a, b = tee(first)
    return compress(zip_longest(b, *rest, fillvalue=fillvalue), zip(a))

def CrazyChucky(*iterables, fillvalue=None):
    SENTINEL = object()
    
    for first, *others in zip_longest(*iterables, fillvalue=SENTINEL):
        if first is SENTINEL:
            return
        others = [i if i is not SENTINEL else fillvalue for i in others]
        yield (first, *others)

def Sven_Marnach(first, *rest, fillvalue=None):
    rest = [iter(r) for r in rest]
    for x in first:
        yield x, *(next(r, fillvalue) for r in rest)

def Mad_Physicist(*args, fillvalue=None):
    # zip_by_first('ABCD', 'xy', fillvalue='-') --> Ax By C- D-
    # zip_by_first('ABC', 'xyzw', fillvalue='-') --> Ax By Cz
    if not args:
        return
    iterators = [iter(it) for it in args]
    while True:
        values = []
        for i, it in enumerate(iterators):
            try:
                value = next(it)
            except StopIteration:
                if i == 0:
                    return
                iterators[i] = repeat(fillvalue)
                value = fillvalue
            values.append(value)
        yield tuple(values)

def Kelly_Bundy_3(first, *rest, fillvalue=None):
    a, b = tee(first)
    return map(itemgetter(1), zip(a, zip_longest(b, *rest, fillvalue=fillvalue)))

def Kelly_Bundy_4(first, *rest, fillvalue=None):
    sentinel = object()
    for z in zip_longest(chain(first, [sentinel]), *rest, fillvalue=fillvalue):
        if z[0] is sentinel:
            break
        yield z

def Kelly_Bundy_5(first, *rest, fillvalue=None):
    stopped = False
    def stop():
        nonlocal stopped
        stopped = True
        return
        yield
    for z in zip_longest(chain(first, stop()), *rest, fillvalue=fillvalue):
        if stopped:
            break
        yield z


import timeit
from itertools import chain, repeat, zip_longest, islice, tee, compress
from operator import itemgetter
from collections import deque

funcs = [
    limit_cheat,
    Kelly_Bundy_chain,
    Kelly_Bundy_compress,
    CrazyChucky,
    Sven_Marnach,
    Mad_Physicist,
    Kelly_Bundy_3,
    Kelly_Bundy_4,
    Kelly_Bundy_5,
]

def test(args_creator):

    # Correctness
    expect = list(funcs[0](*args_creator()))
    for func in funcs:
        result = list(func(*args_creator()))
        print(result == expect, func.__name__)
    
    # Speed
    tss = [[] for _ in funcs]
    for _ in range(5):
        print()
        print(args_creator.__name__)
        for func, ts in zip(funcs, tss):
            t = min(timeit.repeat(lambda: deque(func(*args_creator()), 0), number=1))
            ts.append(t)
            print(*('%4.1f ms ' % (t * 1e3) for t in sorted(ts)[:3]), func.__name__)

def args_few_but_long_iterables():
    global cheat_length
    cheat_length = 50_000
    first = repeat(0, 50_000)
    rest = [repeat(i, 10_000 * i) for i in range(1, 10)]
    return first, *rest

def args_many_but_short_iterables():
    global cheat_length
    cheat_length = 50
    first = repeat(0, 50)
    rest = [repeat(i, i % 101) for i in range(1, 10_000)]
    return first, *rest

test(args_few_but_long_iterables)
funcs[1:3] = funcs[1:3][::-1]
test(args_many_but_short_iterables)
like image 78
Kelly Bundy Avatar answered Sep 03 '25 01:09

Kelly Bundy


You can repurpose the "roughly equivalent" python code shown in the docs for itertools.zip_longest to make a generalized version that zips according to the length of the first argument:

from itertools import repeat

def zip_by_first(*args, fillvalue=None):
    # zip_by_first('ABCD', 'xy', fillvalue='-') --> Ax By C- D-
    # zip_by_first('ABC', 'xyzw', fillvalue='-') --> Ax By Cz
    if not args:
        return
    iterators = [iter(it) for it in args]
    while True:
        values = []
        for i, it in enumerate(iterators):
            try:
                value = next(it)
            except StopIteration:
                if i == 0:
                    return
                iterators[i] = repeat(fillvalue)
                value = fillvalue
            values.append(value)
        yield tuple(values)

You might be able to make some small improvements like caching repeat(fillvalue) or so. The issue with this implementation is that it's written in Python, while most of itertools uses a much faster C implementation. You can see the effects of this by comparing against Kelly Bundy's answer.

like image 36
Mad Physicist Avatar answered Sep 03 '25 00:09

Mad Physicist