Pandas groupby.size vs series.value_counts vs collections.Counter with multiple series

Tags:

There are many questions (1, 2, 3) dealing with counting values in a single series.

However, there are fewer questions looking at the best way to count combinations of two or more series. Solutions are presented (1, 2), but when and why one should use each is not discussed.

Below is some benchmarking for three potential methods. I have two specific questions:

Why is grouper more efficient than count? I expected count to be the more efficient, as it is implemented in C. The superior performance of grouper persists even if number of columns is increased from 2 to 4.
Why does value_counter underperform grouper by so much? Is this due to the cost of constructing a list, or series from list?

I understand the outputs are different, and this should also inform choice. For example, filtering by count is more efficient with contiguous numpy arrays versus a dictionary comprehension:

x, z = grouper(df), count(df)
%timeit x[x.values > 10]                        # 749µs
%timeit {k: v for k, v in z.items() if v > 10}  # 9.37ms

However, the focus of my question is on performance of building comparable results in a series versus dictionary. My C knowledge is limited, yet I would appreciate any answer which can point to the logic underlying these methods.

Benchmarking code

import pandas as pd
import numpy as np
from collections import Counter

np.random.seed(0)

m, n = 1000, 100000

df = pd.DataFrame({'A': np.random.randint(0, m, n),
                   'B': np.random.randint(0, m, n)})

def grouper(df):
    return df.groupby(['A', 'B'], sort=False).size()

def value_counter(df):
    return pd.Series(list(zip(df.A, df.B))).value_counts(sort=False)

def count(df):
    return Counter(zip(df.A.values, df.B.values))

x = value_counter(df).to_dict()
y = grouper(df).to_dict()
z = count(df)

assert (x == y) & (y == z), "Dictionary mismatch!"

for m, n in [(100, 10000), (1000, 10000), (100, 100000), (1000, 100000)]:

    df = pd.DataFrame({'A': np.random.randint(0, m, n),
                       'B': np.random.randint(0, m, n)})

    print(m, n)

    %timeit grouper(df)
    %timeit value_counter(df)
    %timeit count(df)

Benchmarking results

Run on python 3.6.2, pandas 0.20.3, numpy 1.13.1

Machine specs: Windows 7 64-bit, Dual-Core 2.5 GHz, 4GB RAM.

Key: g = grouper, v = value_counter, c = count.

m           n        g        v       c
100     10000     2.91    18.30    8.41
1000    10000     4.10    27.20    6.98[1]
100    100000    17.90   130.00   84.50
1000   100000    43.90   309.00   93.50

¹ This is not a typo.

589

asked May 14 '18 10:05

jpp

1 Answers

There's actually a bit of hidden overhead in zip(df.A.values, df.B.values). The key here comes down to numpy arrays being stored in memory in a fundamentally different way than Python objects.

A numpy array, such as np.arange(10), is essentially stored as a contiguous block of memory, and not as individual Python objects. Conversely, a Python list, such as list(range(10)), is stored in memory as pointers to individual Python objects (i.e. integers 0-9). This difference is the basis for why numpy arrays are smaller in memory than the Python equivalent lists, and why you can perform faster computations on numpy arrays.

So, as Counter is consuming the zip, the associated tuples need to be created as Python objects. This means that Python needs to extract the tuple values from numpy data and create corresponding Python objects in memory. There is noticeable overhead to this, which is why you want to be very careful when combining pure Python functions with numpy data. A basic example of this pitfall that you might commonly see is using the built-in Python sum on a numpy array: sum(np.arange(10**5)) is actually a bit slower than the pure Python sum(range(10**5)), and both of which are of course significantly slower than np.sum(np.arange(10**5)).

See this video for a more in depth discussion of this topic.

As an example specific to this question, observe the following timings comparing the performance of Counter on zipped numpy arrays vs. the corresponding zipped Python lists.

In [2]: a = np.random.randint(10**4, size=10**6)    ...: b = np.random.randint(10**4, size=10**6)    ...: a_list = a.tolist()    ...: b_list = b.tolist()  In [3]: %timeit Counter(zip(a, b)) 455 ms ± 4.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  In [4]: %timeit Counter(zip(a_list, b_list)) 334 ms ± 4.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The difference between these two timings gives you a reasonable estimate of the overhead discussed earlier.

This isn't quite the end of the story though. Constructing a groupby object in pandas involves a some overhead too, at least as related to this problem, since there's some groupby metadata that isn't strictly necessary just to get size, whereas Counter does the one singular thing you care about. Usually this overhead is far less than the overhead associated with Counter, but from some quick experimentation I've found that you can actually get marginally better performance from Counter when the majority of your groups just consist of single elements.

Consider the following timings (using @BallpointBen's sort=False suggestion) that go along the spectrum of few large groups <--> many small groups:

def grouper(df):     return df.groupby(['A', 'B'], sort=False).size()  def count(df):     return Counter(zip(df.A.values, df.B.values))  for m, n in [(10, 10**6), (10**3, 10**6), (10**7, 10**6)]:      df = pd.DataFrame({'A': np.random.randint(0, m, n),                        'B': np.random.randint(0, m, n)})      print(m, n)      %timeit grouper(df)     %timeit count(df)

Which gives me the following table:

m       grouper   counter 10      62.9 ms    315 ms 10**3    191 ms    535 ms 10**7    514 ms    459 ms

Of course, any gains from Counter would be offset by converting back to a Series, if that's what you want as your final object.

193

answered Nov 05 '22 07:11

root

Related questions
                            
                                Enumerations in python [duplicate]
                            
                                redirect sys.stdout to specific Jupyter Notebook cell
                            
                                How to avoid overlapping when there's hundreds of nodes in networkx?
                            
                                Non-deterministic behavior of TensorFlow while_loop()
                            
                                Why is reading one byte 20x slower than reading 2, 3, 4, ... bytes from a file?
                            
                                Embed plotly graph in a Sphinx doc
                            
                                How do I force `setup.py test` to install dependencies into my `virtualenv`?
                            
                                Python double free error for huge datasets
                            
                                How can I add post-install scripts to easy_install / setuptools / distutils?
                            
                                Why does Python threading.Condition() notify() require a lock?
                            
                                Renaming a variable everywhere in Jupyter Notebook
                            
                                How to package a command line Python script
                            
                                Unable to locate files with long names on Windows with Python
                            
                                How are Python metaclasses different from regular class inheritance? [duplicate]
                            
                                How do I deploy a Python desktop application?
                            
                                Can PyCharm drop into debug when py.test tests fail
                            
                                Portable way of detecting number of *usable* CPUs in Python
                            
                                phpMyAdmin equivalent in python? [closed]
                            
                                Static member of a function in Python ? [duplicate]
                            
                                Error opening file in H5PY (File signature not found)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas groupby.size vs series.value_counts vs collections.Counter with multiple series

Tags:

python

dictionary

pandas

dataframe

counter

jpp

People also ask

1 Answers

root

Recent Activity

Donate For Us