What techniques can be used to measure performance of pandas/numpy solutions

Question

How do I measure the performance of the various functions below in a concise and comprehensive way.

Example

Consider the dataframe df

df = pd.DataFrame({         'Group': list('QLCKPXNLNTIXAWYMWACA'),         'Value': [29, 52, 71, 51, 45, 76, 68, 60, 92, 95,                   99, 27, 77, 54, 39, 23, 84, 37, 99, 87]     })

I want to sum up the Value column grouped by distinct values in Group. I have three methods for doing it.

import pandas as pd import numpy as np from numba import njit   def sum_pd(df):     return df.groupby('Group').Value.sum()  def sum_fc(df):     f, u = pd.factorize(df.Group.values)     v = df.Value.values     return pd.Series(np.bincount(f, weights=v).astype(int), pd.Index(u, name='Group'), name='Value').sort_index()  @njit def wbcnt(b, w, k):     bins = np.arange(k)     bins = bins * 0     for i in range(len(b)):         bins[b[i]] += w[i]     return bins  def sum_nb(df):     b, u = pd.factorize(df.Group.values)     w = df.Value.values     bins = wbcnt(b, w, u.size)     return pd.Series(bins, pd.Index(u, name='Group'), name='Value').sort_index()

Are they the same?

print(sum_pd(df).equals(sum_nb(df))) print(sum_pd(df).equals(sum_fc(df)))  True True

How fast are they?

%timeit sum_pd(df) %timeit sum_fc(df) %timeit sum_nb(df)  1000 loops, best of 3: 536 µs per loop 1000 loops, best of 3: 324 µs per loop 1000 loops, best of 3: 300 µs per loop

329

asked Jun 09 '17 23:06

piRSquared

1 Answers

They might not classify as "simple frameworks" because they are third-party modules that need to be installed but there are two frameworks I often use:

simple_benchmark (I'm the author of that package)
perfplot

For example the simple_benchmark library allows to decorate the functions to benchmark:

from simple_benchmark import BenchmarkBuilder b = BenchmarkBuilder()  import pandas as pd import numpy as np from numba import njit  @b.add_function() def sum_pd(df):     return df.groupby('Group').Value.sum()  @b.add_function() def sum_fc(df):     f, u = pd.factorize(df.Group.values)     v = df.Value.values     return pd.Series(np.bincount(f, weights=v).astype(int), pd.Index(u, name='Group'), name='Value').sort_index()  @njit def wbcnt(b, w, k):     bins = np.arange(k)     bins = bins * 0     for i in range(len(b)):         bins[b[i]] += w[i]     return bins  @b.add_function() def sum_nb(df):     b, u = pd.factorize(df.Group.values)     w = df.Value.values     bins = wbcnt(b, w, u.size)     return pd.Series(bins, pd.Index(u, name='Group'), name='Value').sort_index()

Also decorate a function that produces the values for the benchmark:

from string import ascii_uppercase  def creator(n):  # taken from another answer here     letters = list(ascii_uppercase)     np.random.seed([3,1415])     df = pd.DataFrame(dict(             Group=np.random.choice(letters, n),             Value=np.random.randint(100, size=n)         ))     return df  @b.add_arguments('Rows in DataFrame') def argument_provider():     for exponent in range(4, 22):         size = 2**exponent         yield size, creator(size)

And then all you need to run the benchmark is:

r = b.run()

After that you can inspect the results as plot (you need the matplotlib library for this):

r.plot()

enter image description here

In case the functions are very similar in run-time the percentage difference instead of absolute numbers could be more important:

r.plot_difference_percentage(relative_to=sum_nb)

enter image description here

Or get the times for the benchmark as DataFrame (this needs pandas)

r.to_pandas_dataframe()

           sum_pd    sum_fc    sum_nb 16       0.000796  0.000515  0.000502 32       0.000702  0.000453  0.000454 64       0.000702  0.000454  0.000456 128      0.000711  0.000456  0.000458 256      0.000714  0.000461  0.000462 512      0.000728  0.000471  0.000473 1024     0.000746  0.000512  0.000513 2048     0.000825  0.000515  0.000514 4096     0.000902  0.000609  0.000640 8192     0.001056  0.000731  0.000755 16384    0.001381  0.001012  0.000936 32768    0.001885  0.001465  0.001328 65536    0.003404  0.002957  0.002585 131072   0.008076  0.005668  0.005159 262144   0.015532  0.011059  0.010988 524288   0.032517  0.023336  0.018608 1048576  0.055144  0.040367  0.035487 2097152  0.112333  0.080407  0.072154

In case you don't like the decorators you could also setup everything in one call (in that case you don't need the BenchmarkBuilder and the add_function/add_arguments decorators):

from simple_benchmark import benchmark r = benchmark([sum_pd, sum_fc, sum_nb], {2**i: creator(2**i) for i in range(4, 22)}, "Rows in DataFrame")

Here perfplot offers a very similar interface (and result):

import perfplot r = perfplot.bench(     setup=creator,     kernels=[sum_pd, sum_fc, sum_nb],     n_range=[2**k for k in range(4, 22)],     xlabel='Rows in DataFrame',     ) import matplotlib.pyplot as plt plt.loglog() r.plot()

enter image description here

answered Feb 11 '23 07:02

MSeifert

Related questions
                            
                                Does the completionHandler of loadPersistentStores of NSPersistentContainer run synchronously?
                            
                                why can't I import geopandas?
                            
                                Unable to install "Android SDK Platform Tools" from SDK Manager
                            
                                What is the problem with my implementation of the cross-entropy function?
                            
                                How to use Windsor IoC in ASP.net Core 2
                            
                                How can we include php files without specifying the subfolder path
                            
                                Keras: find out the number of layers
                            
                                TypeScript conditional return value type?
                            
                                Writing text to gzip file
                            
                                Offset scroll anchor in HTML with Bootstrap 4 fixed navbar
                            
                                docker push error : tag does not exist
                            
                                Using Cypress, how would I write a simple test to check that a logo image exists on a page

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What techniques can be used to measure performance of pandas/numpy solutions

Tags:

Question

Example

Are they the same?

How fast are they?

piRSquared

People also ask

1 Answers

MSeifert

Recent Activity

Donate For Us