I have a dataframe:
ID DATE WIN
A 2015/6/5 Yes
A 2015/6/7 Yes
A 2015/6/7 Yes
A 2015/6/7 Yes
B 2015/6/8 No
B 2015/8/7 Yes
C 2015/5/15 Yes
C 2015/5/30 No
C 2015/7/30 No
C 2015/8/03 Yes
I want to add a column that counts the number of wins for each ID
within the past 1 month, so the result will look like:
ID DATE WIN NumOfDaysSinceLastWin NumOfWinsInThePast30days
A 2015/6/5 Yes 0 0
A 2015/6/7 Yes 2 1
A 2015/6/7 Yes 2 1 or (A 2015/6/7 Yes 0 2)
A 2015/6/8 No 1 3
B 2015/8/7 No 0 0
B 2015/8/7 Yes 0 0
C 2015/5/15 Yes 0 0
C 2015/5/30 No 15 1
C 2015/7/30 No 76 0
C 2015/8/03 Yes 80 0
How can I use groupby
function and timegrouper
to get this?
Input data has to be sorted in each group by DATE, in this data is it OK.
Input data cannot mapped situation very well, so next 4 rows will be added.
Column WIN1
is created from WIN
- values 1
for 'Yes'
and 0
for 'No'
. I need it for both output columns.
df['WIN1'] = df['WIN'].map(lambda x: 1 if x == 'Yes' else 0)
NumOfDaysSinceLastWin
Firstly column cumsum
(cumulative sum) is created.
df['cumsum'] = df['WIN1'].cumsum()
If all WIN
are 'Yes'
, it is easy. Data will be grouped and difference between date and previous date(-1) value is in columns diffs
.
#df['diffs'] = df.groupby(['ID', 'cumsum'])['DATE'].apply(lambda d: (d-d.shift()).fillna(0))
But situation is complicated, because value 'No'
of column WIN
. So if value is 'Yes'
, you need difference with previous 'Yes'
, if 'No'
need difference with last previous 'WIN'
. Difference can be counted many ways, but there is chosen by subtracting two columns - DATE
and column date1
.
Column date1
Rows has to be grouped special way - values 'No'
and last previous 'Yes'
. It is possible by cumulative sum in column cumsum
.
Then minimal value of this group is value of 'Yes'
column and then this value is repeating to rows with 'No'
values.
Column count
is special - no repeating values of cumsum
column are 1
. Repeating ones are incremented in groups.
df['min'] = df.groupby(['ID','cumsum'])['DATE'].transform('min')
df['count'] = df.groupby(['cumsum'])['cumsum'].transform('count')
Dates of values 'YES'
in previous rows are necessary for difference. Dataframe df1
filtered only values 'Yes'
of df and then then grouped it by column ID
. Indices are unchanged, so output can be mapped to new column of dataframe df
.
df1 = df[~df['WIN'].isin(['No'])]
df['date1'] = df1.groupby(['ID'])['DATE'].apply(lambda d: d.shift())
print df
ID DATE WIN WIN1 cumsum min count date1
0 A 2015-06-05 Yes 1 1 2015-06-05 1 NaT
1 A 2015-06-05 Yes 1 2 2015-06-05 1 2015-06-05
2 A 2015-06-07 Yes 1 3 2015-06-07 1 2015-06-05
3 A 2015-06-07 Yes 1 4 2015-06-07 1 2015-06-07
4 A 2015-06-07 Yes 1 5 2015-06-07 4 2015-06-07
5 A 2015-06-08 No 0 5 2015-06-07 4 NaT
6 B 2015-06-07 No 0 5 2015-06-07 4 NaT
7 B 2015-06-07 No 0 5 2015-06-07 4 NaT
8 B 2015-08-07 Yes 1 6 2015-08-07 1 NaT
9 C 2015-05-15 Yes 1 7 2015-05-15 3 NaT
10 C 2015-05-30 No 0 7 2015-05-15 3 NaT
11 C 2015-07-30 No 0 7 2015-05-15 3 NaT
12 C 2015-08-03 Yes 1 8 2015-08-03 1 2015-05-15
13 C 2015-08-03 Yes 1 9 2015-08-03 1 2015-08-03
Then date columns min
(values 'No'
and last previous 'Yes'
) and column date1
(other values 'Yes'
) can be join by columns count
.
New condition was added - values of column date1
will be null - (NaT
), because these values will be overwritten by column min
.
df.loc[(df['count'] > 1) & (df['date1'].isnull()), 'date1'] = df['min']
print df
ID DATE WIN WIN1 cumsum min count date1
0 A 2015-06-05 Yes 1 1 2015-06-05 1 2015-06-05
1 A 2015-06-05 Yes 1 2 2015-06-05 1 2015-06-05
2 A 2015-06-07 Yes 1 3 2015-06-07 1 2015-06-05
3 A 2015-06-07 Yes 1 4 2015-06-07 1 2015-06-07
4 A 2015-06-07 Yes 1 5 2015-06-07 4 2015-06-07
5 A 2015-06-08 No 0 5 2015-06-07 4 2015-06-07
6 B 2015-06-07 No 0 5 2015-06-07 4 2015-06-07
7 B 2015-06-07 No 0 5 2015-06-07 4 2015-06-07
8 B 2015-08-07 Yes 1 6 2015-08-07 1 2015-08-07
9 C 2015-05-15 Yes 1 7 2015-05-15 3 2015-05-15
10 C 2015-05-30 No 0 7 2015-05-15 3 2015-05-15
11 C 2015-07-30 No 0 7 2015-05-15 3 2015-05-15
12 C 2015-08-03 Yes 1 8 2015-08-03 1 2015-05-15
13 C 2015-08-03 Yes 1 9 2015-08-03 1 2015-08-03
Repeating datetimes - subsolution
Sorry, if this is made this complicated way, maybe someone find better one.
My solution is find repeating values, fill them by last previous 'Yes'
and added to column date1
for difference.
These values are identified in column count
. Other (values 1
) are reset to NaN
. Then values from date1
are copied to date2
by column count
.
df['count'] = df1.groupby(['ID', 'DATE', 'WIN1'])['WIN1'].transform('count')
df.loc[df['count'] == 1 , 'count'] = np.nan
df.loc[df['count'].notnull() , 'date2'] = df['date1']
print df
ID DATE WIN WIN1 cumsum min count date1 date2
0 A 2015-06-05 Yes 1 1 2015-06-05 2 2015-06-05 2015-06-05
1 A 2015-06-05 Yes 1 2 2015-06-05 2 2015-06-05 2015-06-05
2 A 2015-06-07 Yes 1 3 2015-06-07 3 2015-06-05 2015-06-05
3 A 2015-06-07 Yes 1 4 2015-06-07 3 2015-06-07 2015-06-07
4 A 2015-06-07 Yes 1 5 2015-06-07 3 2015-06-07 2015-06-07
5 A 2015-06-08 No 0 5 2015-06-07 NaN 2015-06-07 NaT
6 B 2015-06-07 No 0 5 2015-06-07 NaN 2015-06-07 NaT
7 B 2015-06-07 No 0 5 2015-06-07 NaN 2015-06-07 NaT
8 B 2015-08-07 Yes 1 6 2015-08-07 NaN 2015-08-07 NaT
9 C 2015-05-15 Yes 1 7 2015-05-15 NaN 2015-05-15 NaT
10 C 2015-05-30 No 0 7 2015-05-15 NaN 2015-05-15 NaT
11 C 2015-07-30 No 0 7 2015-05-15 NaN 2015-05-15 NaT
12 C 2015-08-03 Yes 1 8 2015-08-03 2 2015-05-15 2015-05-15
13 C 2015-08-03 Yes 1 9 2015-08-03 2 2015-08-03 2015-08-03
Then this values are repeated by group's minimal value and added to date1
column.
def repeat_value(grp):
grp['date2'] = grp['date2'].min()
return grp
df = df.groupby(['ID', 'DATE']).apply(repeat_value)
df.loc[df1['date2'].notnull() , 'date1'] = df['date2']
print df
ID DATE WIN WIN1 cumsum min count date1 date2
0 A 2015-06-05 Yes 1 1 2015-06-05 2 2015-06-05 2015-06-05
1 A 2015-06-05 Yes 1 2 2015-06-05 2 2015-06-05 2015-06-05
2 A 2015-06-07 Yes 1 3 2015-06-07 3 2015-06-05 2015-06-05
3 A 2015-06-07 Yes 1 4 2015-06-07 3 2015-06-05 2015-06-05
4 A 2015-06-07 Yes 1 5 2015-06-07 3 2015-06-05 2015-06-05
5 A 2015-06-08 No 0 5 2015-06-07 NaN 2015-06-07 NaT
6 B 2015-06-07 No 0 5 2015-06-07 NaN 2015-06-07 NaT
7 B 2015-06-07 No 0 5 2015-06-07 NaN 2015-06-07 NaT
8 B 2015-08-07 Yes 1 6 2015-08-07 NaN 2015-08-07 NaT
9 C 2015-05-15 Yes 1 7 2015-05-15 NaN 2015-05-15 NaT
10 C 2015-05-30 No 0 7 2015-05-15 NaN 2015-05-15 NaT
11 C 2015-07-30 No 0 7 2015-05-15 NaN 2015-05-15 NaT
12 C 2015-08-03 Yes 1 8 2015-08-03 2 2015-05-15 2015-05-15
13 C 2015-08-03 Yes 1 9 2015-08-03 2 2015-05-15 2015-05-15
Column NumOfDaysSinceLastWin
is filled by difference of col date1
and DATE
. Datatype is Timedelta
, so it will be converted to integer. Finally unnecessary columns will be deleted.
(Only columns WIN1
and count
is necessary for next output column, so it isn't deleted.)
df['NumOfDaysSinceLastWin'] = ((df['DATE'] - df['date1']).fillna(0)).astype('timedelta64[D]')
df = df.drop(['cumsum','min', 'date1'], axis=1 )
print df
ID DATE WIN WIN1 count NumOfDaysSinceLastWin
0 A 2015-06-05 Yes 1 2 0
1 A 2015-06-05 Yes 1 2 0
2 A 2015-06-07 Yes 1 3 2
3 A 2015-06-07 Yes 1 3 2
4 A 2015-06-07 Yes 1 3 2
5 A 2015-06-08 No 0 NaN 1
6 B 2015-06-07 No 0 NaN 0
7 B 2015-06-07 No 0 NaN 0
8 B 2015-08-07 Yes 1 NaN 0
9 C 2015-05-15 Yes 1 NaN 0
10 C 2015-05-30 No 0 NaN 15
11 C 2015-07-30 No 0 NaN 76
12 C 2015-08-03 Yes 1 2 80
13 C 2015-08-03 Yes 1 2 80
NumOfWinsInThePast30days
Rolling sum is your friend. Column yes
(necessary for resampling) is mapped by 1
for 'Yes'
and NaN
for 'No'
.
df['yes'] = df['WIN'].map(lambda x: 1 if x == 'Yes' else np.nan)
Dataframe df2
is copy of df
and columns DATE
is set to index (for resampling). Unnecessary columns will be deleted.
df2 = df.set_index('DATE')
df2 = df2.drop(['NumOfDaysSinceLastWin','WIN', 'WIN1'], axis=1)
Then df2
will be resampling by days, if row is 'Yes'
value is 1
, if 'No'
is 0
. (Better see bellow as explain.)
df2 = df2.groupby('ID').resample("D", how='count')
df2 = df2.reset_index()
Dataframe df2
will be grouped by ID
and function rolling_sum
is using for these groups.
df2['rollsum'] = df2.groupby('ID')['yes'].transform(pd.rolling_sum, window=30, min_periods=1)
For better understanding all rows of df2
will be displayed.
with pd.option_context('display.max_rows', 999, 'display.max_columns', 5):
print df2
ID DATE yes rollsum
0 A 2015-06-05 2 2
1 A 2015-06-06 0 2
2 A 2015-06-07 3 5
3 A 2015-06-08 0 5
4 B 2015-06-07 0 0
5 B 2015-06-08 0 0
6 B 2015-06-09 0 0
7 B 2015-06-10 0 0
8 B 2015-06-11 0 0
9 B 2015-06-12 0 0
10 B 2015-06-13 0 0
11 B 2015-06-14 0 0
12 B 2015-06-15 0 0
13 B 2015-06-16 0 0
14 B 2015-06-17 0 0
15 B 2015-06-18 0 0
16 B 2015-06-19 0 0
17 B 2015-06-20 0 0
18 B 2015-06-21 0 0
19 B 2015-06-22 0 0
20 B 2015-06-23 0 0
21 B 2015-06-24 0 0
22 B 2015-06-25 0 0
23 B 2015-06-26 0 0
24 B 2015-06-27 0 0
25 B 2015-06-28 0 0
26 B 2015-06-29 0 0
27 B 2015-06-30 0 0
28 B 2015-07-01 0 0
29 B 2015-07-02 0 0
30 B 2015-07-03 0 0
31 B 2015-07-04 0 0
32 B 2015-07-05 0 0
33 B 2015-07-06 0 0
34 B 2015-07-07 0 0
35 B 2015-07-08 0 0
36 B 2015-07-09 0 0
37 B 2015-07-10 0 0
38 B 2015-07-11 0 0
39 B 2015-07-12 0 0
40 B 2015-07-13 0 0
41 B 2015-07-14 0 0
42 B 2015-07-15 0 0
43 B 2015-07-16 0 0
44 B 2015-07-17 0 0
45 B 2015-07-18 0 0
46 B 2015-07-19 0 0
47 B 2015-07-20 0 0
48 B 2015-07-21 0 0
49 B 2015-07-22 0 0
50 B 2015-07-23 0 0
51 B 2015-07-24 0 0
52 B 2015-07-25 0 0
53 B 2015-07-26 0 0
54 B 2015-07-27 0 0
55 B 2015-07-28 0 0
56 B 2015-07-29 0 0
57 B 2015-07-30 0 0
58 B 2015-07-31 0 0
59 B 2015-08-01 0 0
60 B 2015-08-02 0 0
61 B 2015-08-03 0 0
62 B 2015-08-04 0 0
63 B 2015-08-05 0 0
64 B 2015-08-06 0 0
65 B 2015-08-07 1 1
66 C 2015-05-15 1 1
67 C 2015-05-16 0 1
68 C 2015-05-17 0 1
69 C 2015-05-18 0 1
70 C 2015-05-19 0 1
71 C 2015-05-20 0 1
72 C 2015-05-21 0 1
73 C 2015-05-22 0 1
74 C 2015-05-23 0 1
75 C 2015-05-24 0 1
76 C 2015-05-25 0 1
77 C 2015-05-26 0 1
78 C 2015-05-27 0 1
79 C 2015-05-28 0 1
80 C 2015-05-29 0 1
81 C 2015-05-30 0 1
82 C 2015-05-31 0 1
83 C 2015-06-01 0 1
84 C 2015-06-02 0 1
85 C 2015-06-03 0 1
86 C 2015-06-04 0 1
87 C 2015-06-05 0 1
88 C 2015-06-06 0 1
89 C 2015-06-07 0 1
90 C 2015-06-08 0 1
91 C 2015-06-09 0 1
92 C 2015-06-10 0 1
93 C 2015-06-11 0 1
94 C 2015-06-12 0 1
95 C 2015-06-13 0 1
96 C 2015-06-14 0 0
97 C 2015-06-15 0 0
98 C 2015-06-16 0 0
99 C 2015-06-17 0 0
100 C 2015-06-18 0 0
101 C 2015-06-19 0 0
102 C 2015-06-20 0 0
103 C 2015-06-21 0 0
104 C 2015-06-22 0 0
105 C 2015-06-23 0 0
106 C 2015-06-24 0 0
107 C 2015-06-25 0 0
108 C 2015-06-26 0 0
109 C 2015-06-27 0 0
110 C 2015-06-28 0 0
111 C 2015-06-29 0 0
112 C 2015-06-30 0 0
113 C 2015-07-01 0 0
114 C 2015-07-02 0 0
115 C 2015-07-03 0 0
116 C 2015-07-04 0 0
117 C 2015-07-05 0 0
118 C 2015-07-06 0 0
119 C 2015-07-07 0 0
120 C 2015-07-08 0 0
121 C 2015-07-09 0 0
122 C 2015-07-10 0 0
123 C 2015-07-11 0 0
124 C 2015-07-12 0 0
125 C 2015-07-13 0 0
126 C 2015-07-14 0 0
127 C 2015-07-15 0 0
128 C 2015-07-16 0 0
129 C 2015-07-17 0 0
130 C 2015-07-18 0 0
131 C 2015-07-19 0 0
132 C 2015-07-20 0 0
133 C 2015-07-21 0 0
134 C 2015-07-22 0 0
135 C 2015-07-23 0 0
136 C 2015-07-24 0 0
137 C 2015-07-25 0 0
138 C 2015-07-26 0 0
139 C 2015-07-27 0 0
140 C 2015-07-28 0 0
141 C 2015-07-29 0 0
142 C 2015-07-30 0 0
143 C 2015-07-31 0 0
144 C 2015-08-01 0 0
145 C 2015-08-02 0 0
146 C 2015-08-03 2 2
Unnecessary column yes
will be deleted.
df2 = df2.drop(['yes'], axis=1 )
Output is merging with first dataframe df
.
df2 = pd.merge(df,df2,on=['DATE', 'ID'], how='inner')
print df2
ID DATE WIN WIN1 NumOfDaysSinceLastWin yes rollsum
0 A 2015-06-07 Yes 1 0 1 2
1 A 2015-06-07 Yes 1 0 1 2
2 B 2015-08-07 No 0 0 NaN 1
3 B 2015-08-07 Yes 1 0 1 1
4 C 2015-05-15 Yes 1 0 1 1
5 C 2015-05-30 No 0 15 NaN 1
6 C 2015-07-30 No 0 76 NaN 0
7 C 2015-08-03 Yes 1 80 1 1
If values in column count
are not null
, they are added to column count
.
Function rolling_sum
is counted rows of original df
with values 'YES'
, so it has to be subtracting. And this values (1
) are in column WIN1
.
df2.loc[df['count'].notnull() , 'WIN1'] = df2['count']
df2['NumOfWinsInThePast30days'] = df2['rollsum'] - df2['WIN1']
Delete unnecessary columns.
df2 = df2.drop(['yes','WIN1', 'rollsum', 'count'], axis=1 )
print df2
ID DATE WIN NumOfDaysSinceLastWin NumOfWinsInThePast30days
0 A 2015-06-05 Yes 0 0
1 A 2015-06-05 Yes 0 0
2 A 2015-06-07 Yes 2 2
3 A 2015-06-07 Yes 2 2
4 A 2015-06-07 Yes 2 2
5 A 2015-06-08 No 1 5
6 B 2015-06-07 No 0 0
7 B 2015-06-07 No 0 0
8 B 2015-08-07 Yes 0 0
9 C 2015-05-15 Yes 0 0
10 C 2015-05-30 No 15 1
11 C 2015-07-30 No 76 0
12 C 2015-08-03 Yes 80 0
13 C 2015-08-03 Yes 80 0
And last all together:
import pandas as pd
import numpy as np
import io
#original data
temp=u"""ID,DATE,WIN
A,2015/6/5,Yes
A,2015/6/7,Yes
A,2015/6/7,Yes
A,2015/6/8,No
B,2015/6/7,No
B,2015/8/7,Yes
C,2015/5/15,Yes
C,2015/5/30,No
C,2015/7/30,No
C,2015/8/03,Yes"""
#changed repeating data
temp2=u"""ID,DATE,WIN
A,2015/6/5,Yes
A,2015/6/5,Yes
A,2015/6/7,Yes
A,2015/6/7,Yes
A,2015/6/7,Yes
A,2015/6/8,No
B,2015/6/7,No
B,2015/6/7,No
B,2015/8/7,Yes
C,2015/5/15,Yes
C,2015/5/30,No
C,2015/7/30,No
C,2015/8/03,Yes
C,2015/8/03,Yes"""
df = pd.read_csv(io.StringIO(temp2), parse_dates = [1])
df['WIN1'] = df['WIN'].map(lambda x: 1 if x == 'Yes' else 0)
df['cumsum'] = df['WIN1'].cumsum()
#df['diffs'] = df.groupby(['ID', 'cumsum'])['DATE'].apply(lambda d: (d-d.shift()).fillna(0))
df['min'] = df.groupby(['ID','cumsum'])['DATE'].transform('min')
df['count'] = df.groupby(['cumsum'])['cumsum'].transform('count')
df1 = df[~df['WIN'].isin(['No'])]
df['date1'] = df1.groupby(['ID'])['DATE'].apply(lambda d: d.shift())
print df
df.loc[(df['count'] >= 1) & (df['date1'].isnull()), 'date1'] = df['min']
print df
#resolve repeating datetimes
df['count'] = df1.groupby(['ID', 'DATE', 'WIN1'])['WIN1'].transform('count')
df.loc[df['count'] == 1 , 'count'] = np.nan
df.loc[df['count'].notnull() , 'date2'] = df['date1']
print df
def repeat_value(grp):
grp['date2'] = grp['date2'].min()
return grp
df = df.groupby(['ID', 'DATE']).apply(repeat_value)
df.loc[df['date2'].notnull() , 'date1'] = df['date2']
print df
df['NumOfDaysSinceLastWin'] = (df['DATE'] - df['date1']).astype('timedelta64[D]')
df = df.drop(['cumsum','min','date1', 'date2'], axis=1 )
print df
#NumOfWinsInThePast30days
df['yes'] = df['WIN'].map(lambda x: 1 if x == 'Yes' else np.nan)
df2 = df.set_index('DATE')
df2 = df2.drop(['NumOfDaysSinceLastWin','WIN', 'WIN1','count'], axis=1)
df2 = df2.groupby('ID').resample("D", how='count')
df2 = df2.reset_index()
df2['rollsum'] = df2.groupby('ID')['yes'].transform(pd.rolling_sum, window=30, min_periods=1)
#with pd.option_context('display.max_rows', 999, 'display.max_columns', 5):
#print df2
df2 = df2.drop(['yes'], axis=1 )
df2 = pd.merge(df,df2,on=['DATE', 'ID'], how='inner')
print df2
df2.loc[df['count'].notnull() , 'WIN1'] = df2['count']
df2['NumOfWinsInThePast30days'] = df2['rollsum'] - df2['WIN1']
df2 = df2.drop(['yes','WIN1', 'rollsum', 'count'], axis=1 )
print df2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With