How do I measure the performance of the various functions below in a concise and comprehensive way.
Consider the dataframe df
df = pd.DataFrame({ 'Group': list('QLCKPXNLNTIXAWYMWACA'), 'Value': [29, 52, 71, 51, 45, 76, 68, 60, 92, 95, 99, 27, 77, 54, 39, 23, 84, 37, 99, 87] })
I want to sum up the Value
column grouped by distinct values in Group
. I have three methods for doing it.
import pandas as pd import numpy as np from numba import njit def sum_pd(df): return df.groupby('Group').Value.sum() def sum_fc(df): f, u = pd.factorize(df.Group.values) v = df.Value.values return pd.Series(np.bincount(f, weights=v).astype(int), pd.Index(u, name='Group'), name='Value').sort_index() @njit def wbcnt(b, w, k): bins = np.arange(k) bins = bins * 0 for i in range(len(b)): bins[b[i]] += w[i] return bins def sum_nb(df): b, u = pd.factorize(df.Group.values) w = df.Value.values bins = wbcnt(b, w, u.size) return pd.Series(bins, pd.Index(u, name='Group'), name='Value').sort_index()
print(sum_pd(df).equals(sum_nb(df))) print(sum_pd(df).equals(sum_fc(df))) True True
%timeit sum_pd(df) %timeit sum_fc(df) %timeit sum_nb(df) 1000 loops, best of 3: 536 µs per loop 1000 loops, best of 3: 324 µs per loop 1000 loops, best of 3: 300 µs per loop
Numpy is memory efficient. Pandas has a better performance when a number of rows is 500K or more. Numpy has a better performance when number of rows is 50K or less. Indexing of the pandas series is very slow as compared to numpy arrays.
NumPy is a library for Python that adds support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. Pandas is a high-level data manipulation tool that is built on the NumPy package.
Pandas expands on NumPy by providing easy to use methods for data analysis to operate on the DataFrame and Series classes, which are built on NumPy's powerful ndarray class.
NumPy library provides objects for multi-dimensional arrays, whereas Pandas is capable of offering an in-memory 2d table object called DataFrame. NumPy consumes less memory as compared to Pandas. Indexing of the Series objects is quite slow as compared to NumPy arrays.
They might not classify as "simple frameworks" because they are third-party modules that need to be installed but there are two frameworks I often use:
simple_benchmark
(I'm the author of that package)perfplot
For example the simple_benchmark
library allows to decorate the functions to benchmark:
from simple_benchmark import BenchmarkBuilder b = BenchmarkBuilder() import pandas as pd import numpy as np from numba import njit @b.add_function() def sum_pd(df): return df.groupby('Group').Value.sum() @b.add_function() def sum_fc(df): f, u = pd.factorize(df.Group.values) v = df.Value.values return pd.Series(np.bincount(f, weights=v).astype(int), pd.Index(u, name='Group'), name='Value').sort_index() @njit def wbcnt(b, w, k): bins = np.arange(k) bins = bins * 0 for i in range(len(b)): bins[b[i]] += w[i] return bins @b.add_function() def sum_nb(df): b, u = pd.factorize(df.Group.values) w = df.Value.values bins = wbcnt(b, w, u.size) return pd.Series(bins, pd.Index(u, name='Group'), name='Value').sort_index()
Also decorate a function that produces the values for the benchmark:
from string import ascii_uppercase def creator(n): # taken from another answer here letters = list(ascii_uppercase) np.random.seed([3,1415]) df = pd.DataFrame(dict( Group=np.random.choice(letters, n), Value=np.random.randint(100, size=n) )) return df @b.add_arguments('Rows in DataFrame') def argument_provider(): for exponent in range(4, 22): size = 2**exponent yield size, creator(size)
And then all you need to run the benchmark is:
r = b.run()
After that you can inspect the results as plot (you need the matplotlib
library for this):
r.plot()
In case the functions are very similar in run-time the percentage difference instead of absolute numbers could be more important:
r.plot_difference_percentage(relative_to=sum_nb)
Or get the times for the benchmark as DataFrame
(this needs pandas
)
r.to_pandas_dataframe()
sum_pd sum_fc sum_nb 16 0.000796 0.000515 0.000502 32 0.000702 0.000453 0.000454 64 0.000702 0.000454 0.000456 128 0.000711 0.000456 0.000458 256 0.000714 0.000461 0.000462 512 0.000728 0.000471 0.000473 1024 0.000746 0.000512 0.000513 2048 0.000825 0.000515 0.000514 4096 0.000902 0.000609 0.000640 8192 0.001056 0.000731 0.000755 16384 0.001381 0.001012 0.000936 32768 0.001885 0.001465 0.001328 65536 0.003404 0.002957 0.002585 131072 0.008076 0.005668 0.005159 262144 0.015532 0.011059 0.010988 524288 0.032517 0.023336 0.018608 1048576 0.055144 0.040367 0.035487 2097152 0.112333 0.080407 0.072154
In case you don't like the decorators you could also setup everything in one call (in that case you don't need the BenchmarkBuilder
and the add_function
/add_arguments
decorators):
from simple_benchmark import benchmark r = benchmark([sum_pd, sum_fc, sum_nb], {2**i: creator(2**i) for i in range(4, 22)}, "Rows in DataFrame")
Here perfplot
offers a very similar interface (and result):
import perfplot r = perfplot.bench( setup=creator, kernels=[sum_pd, sum_fc, sum_nb], n_range=[2**k for k in range(4, 22)], xlabel='Rows in DataFrame', ) import matplotlib.pyplot as plt plt.loglog() r.plot()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With