Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there an alternative for zip(*iterable) when the iterable consists of millions of elements?

I have come across a code like this:

from random import randint

class Point:
    def __init__(self, x, y):
        self.x = x
        self.y = y

points = [Point(randint(1, 10), randint(1, 10)) for _ in range(10)]
xs = [point.x for point in points]
ys = [point.y for point in points]

And I think this code is not Pythonic because it repeats itself. If another dimension is added to Point class, a whole new loop needs to be written like:

zs = [point.z for point in points]

So I tried to make it more Pythonic by writing something like this:

xs, ys = zip(*[(point.x, point.y) for point in p])

If a new dimension is added, no problem:

xs, ys, zs = zip(*[(point.x, point.y, point.z) for point in p])

But this is almost 10 times slower than the other solution when there are millions of points, although it has only one loop. I think it is because * operator needs to unpack millions of arguments to the zip function which is horrible. So my question is:

Is there a way to change the code above so that it is as fast as before and Pythonic (without using 3rd party libraries)?

like image 926
Asocia Avatar asked Aug 17 '20 11:08

Asocia


People also ask

What does zip (* list do in Python?

Python's zip() function is defined as zip(*iterables) . The function takes in iterables as arguments and returns an iterator. This iterator generates a series of tuples containing elements from each iterable. zip() can accept any type of iterable, such as files, lists, tuples, dictionaries, sets, and so on.

What is the role of the * operator within the zip () function?

Basically, it passes the contents of the lists as arguments.

Is a zip object iterable?

Zip is a useful in-built function provided by Python to loop through multiple iterables. Zip takes any number of iterables and returns a zip object which is an iterator of tuples. Each ith tuple inside the zip object has ith element from each iterable. We saw multiple examples of using the zip function.

Why do you use the zip () method in Python?

The zip() function returns a zip object, which is an iterator of tuples where the first item in each passed iterator is paired together, and then the second item in each passed iterator are paired together etc.

Why is my ZIP iterable not the same length?

When using zip (), it is important to pay attention to the length iterables. It’s possible that the iterables we pass in as arguments aren’t the same length. In these cases, the number of elements that zip () puts out will be equal to the length of the shortest iterable.

How many iterables are there in Python ZIP ()?

Here, you call the Python zip () function with three iterables, so the resulting tuples have three elements each. When you’re working with the Python zip () function, it’s important to pay attention to the length of your iterables. It’s possible that the iterables you pass in as arguments aren’t the same length.

Why does zip () produce a list of 5 tuples?

Since 5 is the length of the first (and shortest) range () object, zip () outputs a list of five tuples. There are still 95 unmatched elements from the second range () object. These are all ignored by zip () since there are no more elements from the first range () object to complete the pairs.

How to iterate through multiple lists in parallel using Python's zip () function?

Let's now see how Python's zip () function can help us iterate through multiple lists in parallel. Read ahead to find out. Let's start by looking up the documentation for zip () and parse it in the subsequent sections. Syntax: zip (*iterables) – the zip () function takes in one or more iterables as arguments.


1 Answers

I just tested several ways of zipping Point coordinates and looked for their performance with increasing number of points.

Below are the functions I used to test:

def hardcode(points):
    # a hand crafted comprehension for each coordinate
    return [point.x for point in points], [point.y for point in points]


def using_zip(points):
    # using the "problematic" qip function
    return zip(*((point.x, point.y) for point in points))


def loop_and_comprehension(points):
    # making comprehension from a list of coordinate names
    zipped = []
    for coordinate in ('x', 'y'):
        zipped.append([getattr(point, coordinate) for point in points])
    return zipped


def nested_comprehension(points):
    # making comprehension from a list of coordinate names using nested
    # comprehensions
    return [
        [getattr(point, coordinate) for point in points]
        for coordinate in ('x', 'y')
    ]

Using timeit I timed execution of each function with different number of points and here are the results:

comparing processing times using 10 points and 10000000 iterations
hardcode................. 14.12024447 [+0%]
using_zip................ 16.84289724 [+19%]
loop_and_comprehension... 30.83631476 [+118%]
nested_comprehension..... 30.45758349 [+116%]

comparing processing times using 100 points and 1000000 iterations
hardcode................. 9.30594717 [+0%]
using_zip................ 13.74953714 [+48%]
loop_and_comprehension... 19.46766583 [+109%]
nested_comprehension..... 19.27818860 [+107%]

comparing processing times using 1000 points and 100000 iterations
hardcode................. 7.90372457 [+0%]
using_zip................ 12.51523594 [+58%]
loop_and_comprehension... 18.25679913 [+131%]
nested_comprehension..... 18.64352790 [+136%]

comparing processing times using 10000 points and 10000 iterations
hardcode................. 8.27348382 [+0%]
using_zip................ 18.23079485 [+120%]
loop_and_comprehension... 18.00183383 [+118%]
nested_comprehension..... 17.96230063 [+117%]

comparing processing times using 100000 points and 1000 iterations
hardcode................. 9.15848662 [+0%]
using_zip................ 22.70730675 [+148%]
loop_and_comprehension... 17.81126971 [+94%]
nested_comprehension..... 17.86892597 [+95%]

comparing processing times using 1000000 points and 100 iterations
hardcode................. 9.75002857 [+0%]
using_zip................ 23.13891725 [+137%]
loop_and_comprehension... 18.08724660 [+86%]
nested_comprehension..... 18.01269820 [+85%]

comparing processing times using 10000000 points and 10 iterations
hardcode................. 9.96045920 [+0%]
using_zip................ 23.11653558 [+132%]
loop_and_comprehension... 17.98296033 [+81%]
nested_comprehension..... 18.17317708 [+82%]

comparing processing times using 100000000 points and 1 iterations
hardcode................. 64.58698246 [+0%]
using_zip................ 92.53437881 [+43%]
loop_and_comprehension... 73.62493845 [+14%]
nested_comprehension..... 62.99444739 [-2%]

We can see that the gap between the "harcoded" solution and the solutions with comprehensions built with gettattr seems to constantly reduce as the number of points grows.

So, for a very big number of points it could be a good idea to use generated comprehensions from a list of coordinates:

[[getattr(point, coordinate) for point in points]
 for coordinate in ('x', 'y')]

However, for a small number of points this is the worst solution (from the ones I tested at least).


For information, here is the code I used to run this benchmark:

import timeit


...


def compare(nb_points, nb_iterations):
    reference = None
    points = [Point(randint(1, 100), randint(1, 100))
              for _ in range(nb_points)]
    print("comparing processing times using {} points and {} iterations"
          .format(nb_points, nb_iterations))

    for func in (hardcode, using_zip, loop_and_comprehension, nested_comprehension):
        duration = timeit.timeit(lambda: func(points), number=nb_iterations)

        print('{:.<25} {:0=2.8f} [{:0>+.0%}]'
              .format(func.__name__, duration,
                      0 if reference is None else (duration / reference - 1)))

        if reference is None:
            reference = duration

    print("-" * 80)



compare(10, 10000000)
compare(100, 1000000)
compare(1000, 100000)
compare(10000, 10000)
compare(100000, 1000)
compare(1000000, 100)
compare(10000000, 10)
compare(100000000, 1)
like image 58
Tryph Avatar answered Sep 27 '22 23:09

Tryph