Let me provide quick demo which shows that second approach is 10x times slower than the first one.
import pandas as pd
from timeit import default_timer as timer
r = range(1,int(1e7))
df = pd.DataFrame({
'col0': [i % 3 for i in r],
'col1': r
})
df['pad'] = '*' * 100
start = time.time()
print(df.groupby('col0')['col1'].min())
end = time.time()
print(end - start)
start = time.time()
print(df.groupby('col0').min()['col1'])
end = time.time()
print(end - start)
Output:
col0
0 3
1 1
2 2
Name: col1, dtype: int64
0.14302301406860352
col0
0 3
1 1
2 2
Name: col1, dtype: int64
1.4934422969818115
The reason is obvious - in the second case python calculates min also for column pad while in first case it does not do that.
Is there any way to make python aware that computation on DataFrameGroupBy is required for col1 only in the second case?
If this is impossible then I'm curious if this is limitation of current pandas implementation or limitation of the python language itself (i.e. expression df.groupby('col0').min() must be fully computed no matter what follows next).
Thanks
pandas data frames use eager executon model by design
https://pandas.pydata.org/pandas-docs/version/0.18.1/release.html#id96
Eager evaluation of groups when calling groupby functions, so if there is an exception with the grouping function it will raised immediately versus sometime later on when the groups are needed
The alternative is pandas on Spark - https://spark.apache.org/pandas-on-spark/
pandas uses eager evaluation. It loads all the data into memory and executes operations immediately when they are invoked. pandas does not apply query optimization and all the data must be loaded into memory before the query is executed.
It is possible to convert between the two - to_spark/to_pandas.
Similarly it is possible to convert between pandas and traditional Spark data frames - createDataFrame/toPandas.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With