Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas Equivalent for SQL window function and rows range

Consider the minimal example

customer   day  purchase
Joe        1       5
Joe        1      10
Joe        2       5
Joe        2       5       
Joe        4      10
Joe        7       5

In BigQuery, one would do something similar to this to get how much the customer spent in the last 2 days for every day:

SELECT customer, day
, sum(purchase) OVER (PARTITION BY customer ORDER BY day ASC RANGE between 2 preceding and 1 preceding)
FROM table

What would be the equivalent in pandas? i.e., expected outcome

customer   day  purchase    amount_last_2d
Joe        1       5             null  -- spent days [-,-]
Joe        1      10             null  -- spent days [-,-]
Joe        2       5               15  -- spent days [-,1]
Joe        2       5               15  -- spent days [-,1]
Joe        4      10               10  -- spent days [2,3]
Joe        7       5                0  -- spent days [5,6]
like image 627
simon Avatar asked Jan 29 '21 18:01

simon


People also ask

Which Pandas function to display number of rows and columns?

len() method is used to get the number of rows and number of columns individually.

What is window function in Pandas?

Advertisements. For working on numerical data, Pandas provide few variants like rolling, expanding and exponentially moving weights for window statistics. Among these are sum, mean, median, variance, covariance, correlation, etc.

What is the pandas equivalent of windowing and partitioning in SQL?

This post assumes you are familiar with Windowing and Partitioning functions. Let’s translate the most common SQL Analytical function equivalents to Pandas. The Pandas equivalent of row number within each partition with multiple sort by parameters: The Pandas equivalent of average, running average (moving average) window functions:

What are the SQL analytical function equivalents to pandas?

Let’s translate the most common SQL Analytical function equivalents to Pandas. The Pandas equivalent of row number within each partition with multiple sort by parameters: The Pandas equivalent of average, running average (moving average) window functions: .transform is a powerful command in Pandas, which I invite you to learn more about here —

How to get the behaviour of row_number in a pandas Dataframe?

This is as close to a SQL like window functionality as it gets in Pandas. Can also just pass in the pandas Rank function instead wrapping it in lambda. df.groupby (by= ['C1']) ['C2'].transform (pd.DataFrame.rank) To get the behaviour of row_number (), you should pass method='first' to the rank function.

What is the difference between range and rows in SQL?

These rows are used in the window function. On the other hand, the RANGE clause logically limits the rows. That means it considers the rows based on their value compared to the current row. Let’s see a practical example of how these two clauses can return different results. I’m sure you’re familiar with window functions and their syntax.


Video Answer


2 Answers

Try groupby with shift then reindex back

df['new'] = df.groupby(['customer','day']).purchase.sum().shift().reindex(pd.MultiIndex.from_frame(df[['customer','day']])).values
df
Out[259]: 
  customer  day  purchase   new
0      Joe    1         5   NaN
1      Joe    1        10   NaN
2      Joe    2        10  15.0
3      Joe    2         5  15.0
4      Joe    4        10  15.0

Update

s = df.groupby(['customer','day']).apply(lambda x : df.loc[df.customer.isin(x['customer'].tolist()) & (df.day.isin(x['day']-1)|df.day.isin(x['day']-2)),'purchase'].sum())
df['new'] = s.reindex(pd.MultiIndex.from_frame(df[['customer','day']])).values
df
Out[271]: 
  customer  day  purchase  new
0      Joe    1         5    0
1      Joe    1        10    0
2      Joe    2         5   15
3      Joe    2         5   15
4      Joe    4        10   10
5      Joe    7         5    0
like image 107
BENY Avatar answered Oct 19 '22 12:10

BENY


Not sure if this is the right way to go, and this is limited since only one customer is provided; if there were different customers, I would use merge instead of map; Note also that there is also an implicit assumption that the days are ordered in ascending already:

Get the purchase sum based on the groupby combination of customer and day and create a mapping between day and the sum:

sum_purchase = (df.groupby(["customer", "day"])
                 .purchase
                 .sum()
                 .shift()
                 .droplevel(0))

Again, for multiple customers, I would not drop the customer index, and instead use a merge below:

Get a mapping of the days with the difference between the days:

diff_2_days = (df.drop_duplicates("day")[["day"]]
                 .set_index("day", drop=False)
                 .diff()
                 .day)

Create the new column by mapping the above values to the day column, then use np.where to get columns where the diff is less than or equal to 2:

(
    df.assign(
        diff_2_days = df.day.map(diff_2_days),
        sum_purchase = df.day.map(sum_purchase),
        final=lambda df: np.where(df.diff_2_days.le(2), 
                                  df.sum_purchase, 
                                  np.nan))
      .drop(columns=["sum_purchase", "diff_2_days"])
)


    customer    day     purchase    final
0       Joe     1             5     NaN
1       Joe     1            10     NaN
2       Joe     2             5     15.0
3       Joe     2             5     15.0
4       Joe     4            10     10.0
5       Joe     7             5     NaN

Ran your code in postgres to get an idea of what range does and how it differs from rows; quite insightful. I think for windows functions, SQL got this covered and easily too.

SO, let me know where this falls on its face, and I'll gladly have a rejig at it.

like image 2
sammywemmy Avatar answered Oct 19 '22 10:10

sammywemmy