Is there an alternative for zip(*iterable) when the iterable consists of millions of elements?

Tags:

I have come across a code like this:

from random import randint

class Point:
    def __init__(self, x, y):
        self.x = x
        self.y = y

points = [Point(randint(1, 10), randint(1, 10)) for _ in range(10)]
xs = [point.x for point in points]
ys = [point.y for point in points]

And I think this code is not Pythonic because it repeats itself. If another dimension is added to Point class, a whole new loop needs to be written like:

zs = [point.z for point in points]

So I tried to make it more Pythonic by writing something like this:

xs, ys = zip(*[(point.x, point.y) for point in p])

If a new dimension is added, no problem:

xs, ys, zs = zip(*[(point.x, point.y, point.z) for point in p])

But this is almost 10 times slower than the other solution when there are millions of points, although it has only one loop. I think it is because * operator needs to unpack millions of arguments to the zip function which is horrible. So my question is:

Is there a way to change the code above so that it is as fast as before and Pythonic (without using 3rd party libraries)?

926

asked Aug 17 '20 11:08

Asocia

1 Answers

I just tested several ways of zipping Point coordinates and looked for their performance with increasing number of points.

Below are the functions I used to test:

def hardcode(points):
    # a hand crafted comprehension for each coordinate
    return [point.x for point in points], [point.y for point in points]


def using_zip(points):
    # using the "problematic" qip function
    return zip(*((point.x, point.y) for point in points))


def loop_and_comprehension(points):
    # making comprehension from a list of coordinate names
    zipped = []
    for coordinate in ('x', 'y'):
        zipped.append([getattr(point, coordinate) for point in points])
    return zipped


def nested_comprehension(points):
    # making comprehension from a list of coordinate names using nested
    # comprehensions
    return [
        [getattr(point, coordinate) for point in points]
        for coordinate in ('x', 'y')
    ]

Using timeit I timed execution of each function with different number of points and here are the results:

comparing processing times using 10 points and 10000000 iterations
hardcode................. 14.12024447 [+0%]
using_zip................ 16.84289724 [+19%]
loop_and_comprehension... 30.83631476 [+118%]
nested_comprehension..... 30.45758349 [+116%]

comparing processing times using 100 points and 1000000 iterations
hardcode................. 9.30594717 [+0%]
using_zip................ 13.74953714 [+48%]
loop_and_comprehension... 19.46766583 [+109%]
nested_comprehension..... 19.27818860 [+107%]

comparing processing times using 1000 points and 100000 iterations
hardcode................. 7.90372457 [+0%]
using_zip................ 12.51523594 [+58%]
loop_and_comprehension... 18.25679913 [+131%]
nested_comprehension..... 18.64352790 [+136%]

comparing processing times using 10000 points and 10000 iterations
hardcode................. 8.27348382 [+0%]
using_zip................ 18.23079485 [+120%]
loop_and_comprehension... 18.00183383 [+118%]
nested_comprehension..... 17.96230063 [+117%]

comparing processing times using 100000 points and 1000 iterations
hardcode................. 9.15848662 [+0%]
using_zip................ 22.70730675 [+148%]
loop_and_comprehension... 17.81126971 [+94%]
nested_comprehension..... 17.86892597 [+95%]

comparing processing times using 1000000 points and 100 iterations
hardcode................. 9.75002857 [+0%]
using_zip................ 23.13891725 [+137%]
loop_and_comprehension... 18.08724660 [+86%]
nested_comprehension..... 18.01269820 [+85%]

comparing processing times using 10000000 points and 10 iterations
hardcode................. 9.96045920 [+0%]
using_zip................ 23.11653558 [+132%]
loop_and_comprehension... 17.98296033 [+81%]
nested_comprehension..... 18.17317708 [+82%]

comparing processing times using 100000000 points and 1 iterations
hardcode................. 64.58698246 [+0%]
using_zip................ 92.53437881 [+43%]
loop_and_comprehension... 73.62493845 [+14%]
nested_comprehension..... 62.99444739 [-2%]

We can see that the gap between the "harcoded" solution and the solutions with comprehensions built with gettattr seems to constantly reduce as the number of points grows.

So, for a very big number of points it could be a good idea to use generated comprehensions from a list of coordinates:

[[getattr(point, coordinate) for point in points]
 for coordinate in ('x', 'y')]

However, for a small number of points this is the worst solution (from the ones I tested at least).

For information, here is the code I used to run this benchmark:

import timeit


...


def compare(nb_points, nb_iterations):
    reference = None
    points = [Point(randint(1, 100), randint(1, 100))
              for _ in range(nb_points)]
    print("comparing processing times using {} points and {} iterations"
          .format(nb_points, nb_iterations))

    for func in (hardcode, using_zip, loop_and_comprehension, nested_comprehension):
        duration = timeit.timeit(lambda: func(points), number=nb_iterations)

        print('{:.<25} {:0=2.8f} [{:0>+.0%}]'
              .format(func.__name__, duration,
                      0 if reference is None else (duration / reference - 1)))

        if reference is None:
            reference = duration

    print("-" * 80)



compare(10, 10000000)
compare(100, 1000000)
compare(1000, 100000)
compare(10000, 10000)
compare(100000, 1000)
compare(1000000, 100)
compare(10000000, 10)
compare(100000000, 1)

answered Sep 27 '22 23:09

Tryph

Related questions
                            
                                How do I get python2.7 and 3.7 both installed in an alpine docker image
                            
                                What exactly the shear do in ImageDataGenerator of Keras?
                            
                                In Altair, how to set the size of the connected points in a line chart?
                            
                                Conda environment: Print licenses of installed packages
                            
                                Fill in same amount of characters where other column is NaN
                            
                                What are the command line arguments passed to grpc_tools.protoc
                            
                                Tasks linger in celery amqp when publisher is terminated
                            
                                How to create a sheet under a specific folder with google API for python?
                            
                                Port XGBoost model trained in python to another system written in C/C++
                            
                                How to make a new line in django messages.error
                            
                                What is the state of the art way to handle what makefiles do for python data analysis?
                            
                                How to efficiently change data layout of a DataFrame in pandas?
                            
                                How to move the panels in Spyder 4.0?
                            
                                TypeError: ('Keyword argument not understood:', 'inputs')
                            
                                Jupyter Notebook to HTML - Notebook JSON is invalid: ['outputPrepend']
                            
                                How can i set max string field length constraint in pydantic?
                            
                                How to download a file using ipywidget button?
                            
                                django.core.exceptions.ImproperlyConfigured: Set the SECRET_KEY environment variable
                            
                                How to use pyinstaller with matplotlib in use
                            
                                How to sort a group in a way that I get the largest number in the first row and smallest in the second and the second largest in the third and so on

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is there an alternative for zip(*iterable) when the iterable consists of millions of elements?

Tags:

python

optimization

python-3.x

iterable-unpacking

Asocia

People also ask

1 Answers

Tryph

Recent Activity

Donate For Us