Consider the minimal example
customer day purchase
Joe 1 5
Joe 1 10
Joe 2 5
Joe 2 5
Joe 4 10
Joe 7 5
In BigQuery, one would do something similar to this to get how much the customer spent in the last 2 days for every day:
SELECT customer, day
, sum(purchase) OVER (PARTITION BY customer ORDER BY day ASC RANGE between 2 preceding and 1 preceding)
FROM table
What would be the equivalent in pandas? i.e., expected outcome
customer day purchase amount_last_2d
Joe 1 5 null -- spent days [-,-]
Joe 1 10 null -- spent days [-,-]
Joe 2 5 15 -- spent days [-,1]
Joe 2 5 15 -- spent days [-,1]
Joe 4 10 10 -- spent days [2,3]
Joe 7 5 0 -- spent days [5,6]
len() method is used to get the number of rows and number of columns individually.
Advertisements. For working on numerical data, Pandas provide few variants like rolling, expanding and exponentially moving weights for window statistics. Among these are sum, mean, median, variance, covariance, correlation, etc.
This post assumes you are familiar with Windowing and Partitioning functions. Let’s translate the most common SQL Analytical function equivalents to Pandas. The Pandas equivalent of row number within each partition with multiple sort by parameters: The Pandas equivalent of average, running average (moving average) window functions:
Let’s translate the most common SQL Analytical function equivalents to Pandas. The Pandas equivalent of row number within each partition with multiple sort by parameters: The Pandas equivalent of average, running average (moving average) window functions: .transform is a powerful command in Pandas, which I invite you to learn more about here —
This is as close to a SQL like window functionality as it gets in Pandas. Can also just pass in the pandas Rank function instead wrapping it in lambda. df.groupby (by= ['C1']) ['C2'].transform (pd.DataFrame.rank) To get the behaviour of row_number (), you should pass method='first' to the rank function.
These rows are used in the window function. On the other hand, the RANGE clause logically limits the rows. That means it considers the rows based on their value compared to the current row. Let’s see a practical example of how these two clauses can return different results. I’m sure you’re familiar with window functions and their syntax.
Try groupby
with shift
then reindex
back
df['new'] = df.groupby(['customer','day']).purchase.sum().shift().reindex(pd.MultiIndex.from_frame(df[['customer','day']])).values
df
Out[259]:
customer day purchase new
0 Joe 1 5 NaN
1 Joe 1 10 NaN
2 Joe 2 10 15.0
3 Joe 2 5 15.0
4 Joe 4 10 15.0
Update
s = df.groupby(['customer','day']).apply(lambda x : df.loc[df.customer.isin(x['customer'].tolist()) & (df.day.isin(x['day']-1)|df.day.isin(x['day']-2)),'purchase'].sum())
df['new'] = s.reindex(pd.MultiIndex.from_frame(df[['customer','day']])).values
df
Out[271]:
customer day purchase new
0 Joe 1 5 0
1 Joe 1 10 0
2 Joe 2 5 15
3 Joe 2 5 15
4 Joe 4 10 10
5 Joe 7 5 0
Not sure if this is the right way to go, and this is limited since only one customer is provided; if there were different customers, I would use merge
instead of map
; Note also that there is also an implicit assumption that the days are ordered in ascending already:
Get the purchase sum based on the groupby combination of customer
and day
and create a mapping between day
and the sum:
sum_purchase = (df.groupby(["customer", "day"])
.purchase
.sum()
.shift()
.droplevel(0))
Again, for multiple customers, I would not drop the customer
index, and instead use a merge below:
Get a mapping of the days with the difference between the days:
diff_2_days = (df.drop_duplicates("day")[["day"]]
.set_index("day", drop=False)
.diff()
.day)
Create the new column by mapping the above values to the day column, then use np.where
to get columns where the diff is less than or equal to 2:
(
df.assign(
diff_2_days = df.day.map(diff_2_days),
sum_purchase = df.day.map(sum_purchase),
final=lambda df: np.where(df.diff_2_days.le(2),
df.sum_purchase,
np.nan))
.drop(columns=["sum_purchase", "diff_2_days"])
)
customer day purchase final
0 Joe 1 5 NaN
1 Joe 1 10 NaN
2 Joe 2 5 15.0
3 Joe 2 5 15.0
4 Joe 4 10 10.0
5 Joe 7 5 NaN
Ran your code in postgres to get an idea of what range does and how it differs from rows; quite insightful. I think for windows functions, SQL got this covered and easily too.
SO, let me know where this falls on its face, and I'll gladly have a rejig at it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With