I have a CSV
file of customer purchases in no particular order that I read into a Pandas
Dataframe
. I'd like to add a column for each purchase and show how much time has passed since the last purchase, grouped by customer. I'm not sure where it's getting the differences, but they are much too large (even if in seconds).
CSV:
Customer Id,Purchase Date
4543,1/1/2015
4543,2/5/2015
4543,3/15/2015
2322,1/1/2015
2322,3/1/2015
2322,2/1/2015
Python:
import pandas as pd
import time
start = time.time()
data = pd.read_csv('data.csv', low_memory=False)
data = data.sort_values(by=['Customer Id', 'Purchase Date'])
data['Purchase Date'] = pd.to_datetime(data['Purchase Date'])
data['Purchase Difference'] = (data.groupby(['Customer Id'])['Purchase Date']
.diff()
.fillna('-')
)
print data
Output:
Customer Id Purchase Date Purchase Difference
3 2322 2015-01-01 -
5 2322 2015-02-01 2678400000000000
4 2322 2015-03-01 2419200000000000
0 4543 2015-01-01 -
1 4543 2015-02-05 3024000000000000
2 4543 2015-03-15 328320000000000
Desired Output:
Customer Id Purchase Date Purchase Difference
3 2322 2015-01-01 -
5 2322 2015-02-01 31 days
4 2322 2015-03-01 28 days
0 4543 2015-01-01 -
1 4543 2015-02-05 35 days
2 4543 2015-03-15 38 days
You can just apply diff
to the Purchase Date
column once it has been converted to a Timestamp.
df['Purchase Date'] = pd.to_datetime(df['Purchase Date'])
df.sort_values(['Customer Id', 'Purchase Date'], inplace=True)
df['Purchase Difference'] = \
[str(n.days) + ' day' + 's' if n > pd.Timedelta(days=1) else '' if pd.notnull(n) else ""
for n in df.groupby('Customer Id', sort=False)['Purchase Date'].diff()]
>>> df
Customer Id Purchase Date Purchase Difference
3 2322 2015-01-01
5 2322 2015-02-01 31 days
4 2322 2015-03-01 28 days
0 4543 2015-01-01
1 4543 2015-02-05 35 days
2 4543 2015-03-15 38 days
6 4543 2015-03-15
I think you can add to read_csv
parameter parse_dates
for parsing datetime
, sort_values
and last groupby
with diff
:
import pandas as pd
import io
temp=u"""Customer Id,Purchase Date
4543,1/1/2015
4543,2/5/2015
4543,3/15/2015
2322,1/1/2015
2322,3/1/2015
2322,2/1/2015"""
#after testing replace io.StringIO(temp) to filename
data = pd.read_csv(io.StringIO(temp), parse_dates=['Purchase Date'])
data.sort_values(by=['Customer Id', 'Purchase Date'], inplace=True)
data['Purchase Difference'] = data.groupby(['Customer Id'])['Purchase Date'].diff()
print data
Customer Id Purchase Date Purchase Difference
3 2322 2015-01-01 NaT
5 2322 2015-02-01 31 days
4 2322 2015-03-01 28 days
0 4543 2015-01-01 NaT
1 4543 2015-02-05 35 days
2 4543 2015-03-15 38 days
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With