Using Pandas .diff() on a time series column with a groupby

Question

I have a CSV file of customer purchases in no particular order that I read into a Pandas Dataframe. I'd like to add a column for each purchase and show how much time has passed since the last purchase, grouped by customer. I'm not sure where it's getting the differences, but they are much too large (even if in seconds).

CSV:

Customer Id,Purchase Date
4543,1/1/2015
4543,2/5/2015
4543,3/15/2015
2322,1/1/2015
2322,3/1/2015
2322,2/1/2015

Python:

import pandas as pd
import time
start = time.time()
data = pd.read_csv('data.csv', low_memory=False)
data = data.sort_values(by=['Customer Id', 'Purchase Date'])
data['Purchase Date'] = pd.to_datetime(data['Purchase Date'])
data['Purchase Difference'] = (data.groupby(['Customer Id'])['Purchase Date']
                         .diff()
                         .fillna('-')
                       )
print data

Output:

    Customer Id Purchase Date Purchase Difference
3         2322    2015-01-01                   -
5         2322    2015-02-01    2678400000000000
4         2322    2015-03-01    2419200000000000
0         4543    2015-01-01                   -
1         4543    2015-02-05    3024000000000000
2         4543    2015-03-15    328320000000000

Desired Output:

   Customer Id Purchase Date  Purchase Difference
3         2322    2015-01-01                  -
5         2322    2015-02-01              31 days
4         2322    2015-03-01              28 days
0         4543    2015-01-01                  -
1         4543    2015-02-05              35 days
2         4543    2015-03-15              38 days

Alexander · Accepted Answer

You can just apply diff to the Purchase Date column once it has been converted to a Timestamp.

df['Purchase Date'] = pd.to_datetime(df['Purchase Date'])
df.sort_values(['Customer Id', 'Purchase Date'], inplace=True)    
df['Purchase Difference'] = \
    [str(n.days) + ' day' + 's' if n > pd.Timedelta(days=1) else '' if pd.notnull(n) else "" 
     for n in df.groupby('Customer Id', sort=False)['Purchase Date'].diff()]

>>> df
   Customer Id Purchase Date Purchase Difference
3         2322    2015-01-01                    
5         2322    2015-02-01             31 days
4         2322    2015-03-01             28 days
0         4543    2015-01-01                    
1         4543    2015-02-05             35 days
2         4543    2015-03-15             38 days
6         4543    2015-03-15

jezrael · Answer

I think you can add to read_csv parameter parse_dates for parsing datetime, sort_values and last groupby with diff:

import pandas as pd
import io

temp=u"""Customer Id,Purchase Date
4543,1/1/2015
4543,2/5/2015
4543,3/15/2015
2322,1/1/2015
2322,3/1/2015
2322,2/1/2015"""
#after testing replace io.StringIO(temp) to filename
data = pd.read_csv(io.StringIO(temp), parse_dates=['Purchase Date'])

data.sort_values(by=['Customer Id', 'Purchase Date'], inplace=True)

data['Purchase Difference'] = data.groupby(['Customer Id'])['Purchase Date'].diff()
print data
   Customer Id Purchase Date  Purchase Difference
3         2322    2015-01-01                  NaT
5         2322    2015-02-01              31 days
4         2322    2015-03-01              28 days
0         4543    2015-01-01                  NaT
1         4543    2015-02-05              35 days
2         4543    2015-03-15              38 days

Using Pandas .diff() on a time series column with a groupby

Tags:

python

pandas

python-2.7

user2242044

2 Answers

Alexander

jezrael

Recent Activity

Donate For Us

Using Pandas .diff() on a time series column with a groupby

Tags:

python

pandas

python-2.7

user2242044

2 Answers

Alexander

jezrael

Related questions

Recent Activity

Donate For Us