Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What techniques can be used to measure performance of pandas/numpy solutions

Tags:

Question

How do I measure the performance of the various functions below in a concise and comprehensive way.

Example

Consider the dataframe df

df = pd.DataFrame({         'Group': list('QLCKPXNLNTIXAWYMWACA'),         'Value': [29, 52, 71, 51, 45, 76, 68, 60, 92, 95,                   99, 27, 77, 54, 39, 23, 84, 37, 99, 87]     }) 

I want to sum up the Value column grouped by distinct values in Group. I have three methods for doing it.

import pandas as pd import numpy as np from numba import njit   def sum_pd(df):     return df.groupby('Group').Value.sum()  def sum_fc(df):     f, u = pd.factorize(df.Group.values)     v = df.Value.values     return pd.Series(np.bincount(f, weights=v).astype(int), pd.Index(u, name='Group'), name='Value').sort_index()  @njit def wbcnt(b, w, k):     bins = np.arange(k)     bins = bins * 0     for i in range(len(b)):         bins[b[i]] += w[i]     return bins  def sum_nb(df):     b, u = pd.factorize(df.Group.values)     w = df.Value.values     bins = wbcnt(b, w, u.size)     return pd.Series(bins, pd.Index(u, name='Group'), name='Value').sort_index() 

Are they the same?

print(sum_pd(df).equals(sum_nb(df))) print(sum_pd(df).equals(sum_fc(df)))  True True 

How fast are they?

%timeit sum_pd(df) %timeit sum_fc(df) %timeit sum_nb(df)  1000 loops, best of 3: 536 µs per loop 1000 loops, best of 3: 324 µs per loop 1000 loops, best of 3: 300 µs per loop 
like image 329
piRSquared Avatar asked Jun 09 '17 23:06

piRSquared


People also ask

Which is faster pandas or NumPy?

Numpy is memory efficient. Pandas has a better performance when a number of rows is 500K or more. Numpy has a better performance when number of rows is 50K or less. Indexing of the pandas series is very slow as compared to numpy arrays.

What is the use of NumPy in pandas?

NumPy is a library for Python that adds support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. Pandas is a high-level data manipulation tool that is built on the NumPy package.

Can you use NumPy on pandas DataFrame?

Pandas expands on NumPy by providing easy to use methods for data analysis to operate on the DataFrame and Series classes, which are built on NumPy's powerful ndarray class.

Why we use NumPy and pandas in machine learning?

NumPy library provides objects for multi-dimensional arrays, whereas Pandas is capable of offering an in-memory 2d table object called DataFrame. NumPy consumes less memory as compared to Pandas. Indexing of the Series objects is quite slow as compared to NumPy arrays.


1 Answers

They might not classify as "simple frameworks" because they are third-party modules that need to be installed but there are two frameworks I often use:

  • simple_benchmark (I'm the author of that package)
  • perfplot

For example the simple_benchmark library allows to decorate the functions to benchmark:

from simple_benchmark import BenchmarkBuilder b = BenchmarkBuilder()  import pandas as pd import numpy as np from numba import njit  @b.add_function() def sum_pd(df):     return df.groupby('Group').Value.sum()  @b.add_function() def sum_fc(df):     f, u = pd.factorize(df.Group.values)     v = df.Value.values     return pd.Series(np.bincount(f, weights=v).astype(int), pd.Index(u, name='Group'), name='Value').sort_index()  @njit def wbcnt(b, w, k):     bins = np.arange(k)     bins = bins * 0     for i in range(len(b)):         bins[b[i]] += w[i]     return bins  @b.add_function() def sum_nb(df):     b, u = pd.factorize(df.Group.values)     w = df.Value.values     bins = wbcnt(b, w, u.size)     return pd.Series(bins, pd.Index(u, name='Group'), name='Value').sort_index() 

Also decorate a function that produces the values for the benchmark:

from string import ascii_uppercase  def creator(n):  # taken from another answer here     letters = list(ascii_uppercase)     np.random.seed([3,1415])     df = pd.DataFrame(dict(             Group=np.random.choice(letters, n),             Value=np.random.randint(100, size=n)         ))     return df  @b.add_arguments('Rows in DataFrame') def argument_provider():     for exponent in range(4, 22):         size = 2**exponent         yield size, creator(size) 

And then all you need to run the benchmark is:

r = b.run() 

After that you can inspect the results as plot (you need the matplotlib library for this):

r.plot() 

enter image description here

In case the functions are very similar in run-time the percentage difference instead of absolute numbers could be more important:

r.plot_difference_percentage(relative_to=sum_nb)  

enter image description here

Or get the times for the benchmark as DataFrame (this needs pandas)

r.to_pandas_dataframe() 
           sum_pd    sum_fc    sum_nb 16       0.000796  0.000515  0.000502 32       0.000702  0.000453  0.000454 64       0.000702  0.000454  0.000456 128      0.000711  0.000456  0.000458 256      0.000714  0.000461  0.000462 512      0.000728  0.000471  0.000473 1024     0.000746  0.000512  0.000513 2048     0.000825  0.000515  0.000514 4096     0.000902  0.000609  0.000640 8192     0.001056  0.000731  0.000755 16384    0.001381  0.001012  0.000936 32768    0.001885  0.001465  0.001328 65536    0.003404  0.002957  0.002585 131072   0.008076  0.005668  0.005159 262144   0.015532  0.011059  0.010988 524288   0.032517  0.023336  0.018608 1048576  0.055144  0.040367  0.035487 2097152  0.112333  0.080407  0.072154 

In case you don't like the decorators you could also setup everything in one call (in that case you don't need the BenchmarkBuilder and the add_function/add_arguments decorators):

from simple_benchmark import benchmark r = benchmark([sum_pd, sum_fc, sum_nb], {2**i: creator(2**i) for i in range(4, 22)}, "Rows in DataFrame") 

Here perfplot offers a very similar interface (and result):

import perfplot r = perfplot.bench(     setup=creator,     kernels=[sum_pd, sum_fc, sum_nb],     n_range=[2**k for k in range(4, 22)],     xlabel='Rows in DataFrame',     ) import matplotlib.pyplot as plt plt.loglog() r.plot() 

enter image description here

like image 70
MSeifert Avatar answered Feb 11 '23 07:02

MSeifert