The following code summarizes numeric data using two different approaches.
The first approach uses the Dataframe().describe() and passes some specific extra percentiles.
The second approach separately computes the summary stats (mean, std, N), stacks it, computes the same quantiles, then appends the two and sorts by the index so the result is essentially the same as the first approach.
There are some minor naming differences that we can clean up afterword's and since the summarized data is small, that is very fast.
Turns out that using the describe function was about 8x slower in this example.
I am looking for reasons why and perhaps suggestions on any other approaches that may speed this up even more (filters, groups, values) are all passed in from UI to a tornado service - so speed is important, as the user is waiting for results, and the data can be even larger that this example.
import pandas as pd
import numpy as np
from datetime import datetime
def make_data (n):
ts = datetime.now().timestamp() + abs(np.random.normal(60, 30, n)).cumsum()
df = pd.DataFrame({
'c1': np.random.choice(list('ABCDEFGH'), n),
'c2': np.random.choice(list('ABCDEFGH'), n),
'c3': np.random.choice(list('ABCDEFGH'), n),
't1': np.random.randint(1, 20, n),
't2': pd.to_datetime(ts, unit='s'),
'x1': np.random.randn(n),
'x2': np.random.randn(n),
'x3': np.random.randn(n)
})
return df
def summarize_numeric_1 (df, mask, groups, values, quantiles):
dfg = df[mask].groupby(groups)[values]
return dfg.describe(percentiles = quantiles).stack()
def summarize_numeric_2 (df, filt, groups, values, quantiles):
dfg = df[mask].groupby(groups)[values]
dfg_stats = dfg.agg([np.mean, np.std, len]).stack()
dfg_quantiles = dfg.quantile(all_quantiles)
return dfg_stats.append(dfg_quantiles).sort_index()
%time df = make_data(1000000)
groups = ['c1', 'c2', 't1']
mask = df['c3'].eq('H') & df['c1'].eq('A')
values = ['x1', 'x3']
base_quantiles = [0, .5, 1]
extd_quantiles = [0.25, 0.75, 0.9]
all_quantiles = base_quantiles + extd_quantiles
%timeit summarize_numeric_1(df, mask, groups, values, extd_quantiles)
%timeit summarize_numeric_2(df, mask, groups, values, all_quantiles)
The timings on my PC for this are:
Using describe: 873 ms ± 8.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using two step method: 105 ms ± 490 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
All inputs welcome!
Show activity on this post. No, the order doesn't matter for the GROUP BY clause.
(1) Splitting the data into groups. (2). Applying a function to each group independently, (3) Combining the results into a data structure.
Pandas dataframe. groupby() function is one of the most useful function in the library it splits the data into groups based on columns/conditions and then apply some operations eg. size() which counts the number of entries/rows in each group.
How to groupby multiple columns in pandas DataFrame and compute multiple aggregations? groupby() can take the list of columns to group by multiple columns and use the aggregate functions to apply single or multiple aggregations at the same time.
I will post this as an answer, maybe to be deleted later, since it is more of an educated guess than an actual answer. Also it is a bit too long for a comment.
So the first thing I did after reading your answer was to re-run your timings in a profiler to take a closer look at the issue. Since the time of the computation itself was rather short, it got overshadowed quite bit by the data generation. However in general the times were similar to what you described. Not only that, the difference got even more pronounced:
1094ms for the first approach vs 63ms for the second one. This makes for the factor of 17 difference.
Since the lower of the times was rather small, I decided it was too small to be trusted and re-run the test with *10 generated data sample size. It upped the data generation step to one minute, and the numbers got weird: 1173ms for the first approach vs 506ms for the second one. Factor only slightly worse than two.
I started to suspect something. To confirm my suspicion I run one last test once again increasing the data size by the factor of 10. Result might surprise you:
12258ms for the first approach vs 3646ms for the second one. The tables have turned completely, the factor being ~0.3.
My guess in this case is that pandas computation is actually the one with better optimization/algorithm. However since it is pandas, it has quite a bit of additional baggage around it - the price that is paid for the convenience and robustness. This means that there is a layer of "unnecessary" (computation wise) baggage, that needs to be carried around no matter how large the data set is.
So in case you want to be faster than pandas even on data sets of your size, take their operations and write them yourself - in the most straightforward manner possible. This will keep their optimization & discard the baggage paid for convenience.
Note: this answer is for pandas version 1.0.5 . Things might be different for other versions.
pandas describe()
method will always be slower than your version, because under the hood it's using almost the exact same logic, plus some other things like making sure the data has the right dimensions, ordering the results, and checking for NaNs and the correct data types.
Taking a look at the source code of the describe
method, we can see a few things:
describe()
method for an example of how it's using the same logic. This means that pandas describe
will always be slower.s.count()
, but your code counts all values. Let's try and modify your code to use that same method instead of len()
:def summarize_numeric_3(df, filt, groups, values, quantiles):
dfg = df[mask].groupby(groups)[values]
dfg_stats = dfg.agg([np.mean, np.std, pd.Series.count]).stack()
dfg_quantiles = dfg.quantile(all_quantiles)
return dfg_stats.append(dfg_quantiles).sort_index()
%timeit -n 10 summarize_numeric_3(df, mask, groups, values, all_quantiles)
# outputs
# 48.9 ms ± 1.11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
it takes ~49 ms, vs 42 ms for your version on my machine. It took 7 more ms just for this relatively small modification!
describe
method code into a "self-contained"* version that you can profile and tinker with here (too long to be posted in this answer). Profiling that, I see that a very large portion of the time is taken to "set a convenient order for rows". Removing that ordering improved the describe
timing by ~8%, from 530 ms down to 489 ms.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With