Python

Question

I have a dataset with the following first three columns. Include Basket ID (unique identifier), Sale amount (dollars) and date of the transaction. I want to calculate the following column for each row of the dataset, and I would like to it in Python.

Previous Sale of the same basket (if any); Sale Count to date for current basket; Mean To Date for current basket (if available); Max To Date for current basket (if available)

Basket  Sale   Date       PrevSale SaleCount MeanToDate MaxToDate
88      $15 3/01/2012                1      
88      $30 11/02/2012      $15      2         $23        $30
88      $16 16/08/2012      $30      3         $20        $30
123     $90 18/06/2012               1      
477     $77 19/08/2012               1      
477     $57 11/12/2012      $77      2         $67        $77
566     $90 6/07/2012                1

I'm pretty new with Python, and I really struggle to find anything to do it in a fancy way. I've sorted the data (as above) by BasketID and Date, so I can get the previous sale in bulk by shifting forward by one for each single basket. No clue how to get the MeanToDate and MaxToDate in an efficient way apart from looping... any ideas?

mtth · Accepted Answer

This should do the trick:

from pandas import concat
from pandas.stats.moments import expanding_mean, expanding_count

def handler(grouped):
    se = grouped.set_index('Date')['Sale'].sort_index()
    # se is the (ordered) time series of sales restricted to a single basket
    # we can now create a dataframe by combining different metrics
    # pandas has a function for each of the ones you are interested in!
    return  concat(
        {
            'MeanToDate': expanding_mean(se), # cumulative mean
            'MaxToDate': se.cummax(),         # cumulative max
            'SaleCount': expanding_count(se), # cumulative count
            'Sale': se,                       # simple copy
            'PrevSale': se.shift(1)           # previous sale
        },
        axis=1
     )

# we then apply this handler to all the groups and pandas combines them
# back into a single dataframe indexed by (Basket, Date)
# we simply need to reset the index to get the shape you mention in your question
new_df = df.groupby('Basket').apply(handler).reset_index()

You can read more about grouping/aggregating here.

Python - time series alignment and "to date" functions

Tags:

date

alignment

pandas

time-series

Alessandro Mariani

1 Answers

mtth

Recent Activity

Donate For Us