Let's say I have the following dataset:
import pandas as pd
import numpy as np
df = pd.read_csv("https://raw.github.com/pandas-dev/pandas/master/pandas/tests/data/tips.csv")
df["tip_fcst"] = np.random.uniform(low=0, high=0.40, size=len(df))
df["tip_fcst"] = df.tip_fcst * df.total_bill
df.head(5)
total_bill tip sex smoker day time size tip_fcst
0 16.99 1.01 Female No Sun Dinner 2 1.123689
1 10.34 1.66 Male No Sun Dinner 3 3.125474
2 21.01 3.50 Male No Sun Dinner 3 2.439321
3 23.68 3.31 Male No Sun Dinner 2 3.099715
4 24.59 3.61 Female No Sun Dinner 4 1.785596
And I am performing the following operations
time_table = (
df
.groupby("time")
.agg({"tip": lambda x:
df.ix[x.index].tip.sum() / df.ix[x.index].total_bill.sum(),
"tip_fcst": lambda x:
df.ix[x.index].tip_fcst.sum() / df.ix[x.index].total_bill.sum()
})
)
What I would like to do is add another step using assign
to create a new variable called difference
. The problem I'm having is that I don't know how to reference the "current version" of the dataframe to use the newly created variables. I realize that I could just save what I have so far to time_table
and then use time_table["difference"] = time_table.tip_fcst - time_table.tip
, but I like this flow of chained operations and was hoping there was a way to do it within there. Is this possible?
You could chain these altogether if you assign
the selected DF
with a lambda
function:
(df.groupby("time").agg({"tip": lambda x: df.ix[x.index].tip.sum() / df.ix[x.index].total_bill.sum(),
"tip_fcst": lambda x: df.ix[x.index].tip_fcst.sum() / df.ix[x.index].total_bill.sum()})
).assign(difference=lambda x: x.tip_fcst - x.tip)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With