Pandas Equivalent for SQL window function and rows range

Tags:

Consider the minimal example

customer   day  purchase
Joe        1       5
Joe        1      10
Joe        2       5
Joe        2       5       
Joe        4      10
Joe        7       5

In BigQuery, one would do something similar to this to get how much the customer spent in the last 2 days for every day:

SELECT customer, day
, sum(purchase) OVER (PARTITION BY customer ORDER BY day ASC RANGE between 2 preceding and 1 preceding)
FROM table

What would be the equivalent in pandas? i.e., expected outcome

customer   day  purchase    amount_last_2d
Joe        1       5             null  -- spent days [-,-]
Joe        1      10             null  -- spent days [-,-]
Joe        2       5               15  -- spent days [-,1]
Joe        2       5               15  -- spent days [-,1]
Joe        4      10               10  -- spent days [2,3]
Joe        7       5                0  -- spent days [5,6]

627

asked Jan 29 '21 18:01

Video Answer

2 Answers

Try groupby with shift then reindex back

df['new'] = df.groupby(['customer','day']).purchase.sum().shift().reindex(pd.MultiIndex.from_frame(df[['customer','day']])).values
df
Out[259]: 
  customer  day  purchase   new
0      Joe    1         5   NaN
1      Joe    1        10   NaN
2      Joe    2        10  15.0
3      Joe    2         5  15.0
4      Joe    4        10  15.0

Update

s = df.groupby(['customer','day']).apply(lambda x : df.loc[df.customer.isin(x['customer'].tolist()) & (df.day.isin(x['day']-1)|df.day.isin(x['day']-2)),'purchase'].sum())
df['new'] = s.reindex(pd.MultiIndex.from_frame(df[['customer','day']])).values
df
Out[271]: 
  customer  day  purchase  new
0      Joe    1         5    0
1      Joe    1        10    0
2      Joe    2         5   15
3      Joe    2         5   15
4      Joe    4        10   10
5      Joe    7         5    0

107

answered Oct 19 '22 12:10

Not sure if this is the right way to go, and this is limited since only one customer is provided; if there were different customers, I would use merge instead of map; Note also that there is also an implicit assumption that the days are ordered in ascending already:

Get the purchase sum based on the groupby combination of customer and day and create a mapping between day and the sum:

sum_purchase = (df.groupby(["customer", "day"])
                 .purchase
                 .sum()
                 .shift()
                 .droplevel(0))

Again, for multiple customers, I would not drop the customer index, and instead use a merge below:

Get a mapping of the days with the difference between the days:

diff_2_days = (df.drop_duplicates("day")[["day"]]
                 .set_index("day", drop=False)
                 .diff()
                 .day)

Create the new column by mapping the above values to the day column, then use np.where to get columns where the diff is less than or equal to 2:

(
    df.assign(
        diff_2_days = df.day.map(diff_2_days),
        sum_purchase = df.day.map(sum_purchase),
        final=lambda df: np.where(df.diff_2_days.le(2), 
                                  df.sum_purchase, 
                                  np.nan))
      .drop(columns=["sum_purchase", "diff_2_days"])
)


    customer    day     purchase    final
0       Joe     1             5     NaN
1       Joe     1            10     NaN
2       Joe     2             5     15.0
3       Joe     2             5     15.0
4       Joe     4            10     10.0
5       Joe     7             5     NaN

Ran your code in postgres to get an idea of what range does and how it differs from rows; quite insightful. I think for windows functions, SQL got this covered and easily too.

SO, let me know where this falls on its face, and I'll gladly have a rejig at it.

answered Oct 19 '22 10:10

sammywemmy

Related questions
                            
                                Simple data operations: R vs python
                            
                                pandas: How to keep the last `n` records of each group sorted by another variable?
                            
                                Pandas groupby, resample, etc for subclassed DataFrame
                            
                                filtering a Pandas DataFrame using dictionary
                            
                                How to extract data from a Tweepy object into a pandas dataframe?
                            
                                Generate a column based on a constraint in pandas
                            
                                How to convert nested json structure to dataframe
                            
                                How to merge and groupby between seperate dataframes
                            
                                How to use time as x axis for a scatterplot with seaborn?
                            
                                Do I need to split the data for isolation forest?
                            
                                Pandas - Generate Unique ID based on row values
                            
                                Automatically determine header row when reading csv in pandas
                            
                                Pandas: Select all data from Pandas DataFrame between two dates
                            
                                how to change string matrix to a integer matrix
                            
                                How can I copy DataFrames with datetimes from Stack Overflow into Python?
                            
                                How to convert a dataframe from long to wide, with values grouped by year in the index?
                            
                                creating a json object from pandas dataframe
                            
                                Regular expression to find a sequence of numbers before multiple patterns, into a new column (Python, Pandas)
                            
                                How to have pandas perform a rolling average on a non-uniform x-grid
                            
                                Running two dask-ml imputers simultaneously instead of sequentially

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas Equivalent for SQL window function and rows range

Tags:

pandas

range

window-functions

google-bigquery

simon

People also ask

Video Answer

2 Answers

BENY

sammywemmy

Recent Activity

Donate For Us