I have a dataset with an id column, date column and value. I would like to count the consecutive appearances/duplicate values of id for a continuous date range.
My question is very much like Count consecutive duplicate values by group but in Python. Moreover, the question is different from How to find duplicates in pandas dataframe because I need the count to be based on two columns where one is not identical - it is the date (which changes, but if it's consecutive I want to count it)
Here is a sample dataset:
ID tDate value
79 2019-06-21 00:00:00 397
79 2019-07-13 00:00:00 404
79 2019-07-18 00:00:00 405
79 2019-07-19 00:00:00 406
79 2019-08-02 00:00:00 410
79 2019-08-09 00:00:00 413
I would like to resultant dataset to be:
ID tDate val consec_count
79 2019-06-21 00:00:00 397 0
79 2019-07-13 00:00:00 404 0
79 2019-07-18 00:00:00 405 1
79 2019-07-19 00:00:00 406 2
79 2019-08-02 00:00:00 410 0
79 2019-08-09 00:00:00 413 0
I marked the "singles" with 0 and not 1 because I need to tell the two apart. I will deal with batches of 'duplicates' differently than single records.
Thank you!
Sample:
df = pd.DataFrame({'ID': [79, 79, 79, 79, 79, 79, 80, 80, 80, 80, 80, 80, 80],
'tDate': [pd.Timestamp('2019-07-12 00:00:00'),
pd.Timestamp('2019-07-13 00:00:00'),
pd.Timestamp('2019-07-18 00:00:00'),
pd.Timestamp('2019-07-19 00:00:00'),
pd.Timestamp('2019-07-20 00:00:00'),
pd.Timestamp('2019-08-03 00:00:00'),
pd.Timestamp('2019-06-21 00:00:00'),
pd.Timestamp('2019-06-22 00:00:00'),
pd.Timestamp('2019-07-18 00:00:00'),
pd.Timestamp('2019-07-19 00:00:00'),
pd.Timestamp('2019-07-26 00:00:00'),
pd.Timestamp('2019-08-02 00:00:00'),
pd.Timestamp('2019-08-03 00:00:00')],
'value':[397, 404, 405, 406, 408, 413, 397, 404, 405, 406, 408, 410, 413]})
print (df)
ID tDate value
0 79 2019-07-12 397
1 79 2019-07-13 404
2 79 2019-07-18 405
3 79 2019-07-19 406
4 79 2019-07-20 408
5 79 2019-08-03 413
6 80 2019-06-21 397
7 80 2019-06-22 404
8 80 2019-07-18 405
9 80 2019-07-19 406
10 80 2019-07-26 408
11 80 2019-08-02 410
12 80 2019-08-03 413
Solution:
a = df.groupby('ID')['tDate'].diff().eq(pd.Timedelta(1, unit='d'))
s = (~a).cumsum()
df['consec_count']=np.where(a.groupby(s).transform('any'), df.groupby(s).cumcount(1).add(1),0)
print (df)
ID tDate value consec_count
0 79 2019-07-12 397 1
1 79 2019-07-13 404 2
2 79 2019-07-18 405 1
3 79 2019-07-19 406 2
4 79 2019-07-20 408 3
5 79 2019-08-03 413 0
6 80 2019-06-21 397 1
7 80 2019-06-22 404 2
8 80 2019-07-18 405 1
9 80 2019-07-19 406 2
10 80 2019-07-26 408 0
11 80 2019-08-02 410 1
12 80 2019-08-03 413 2
Explanation:
First create mask for compare difference per groups by DataFrameGroupBy.diff
with one day:
print (df.assign(diff= df.groupby('ID')['tDate'].diff(),
a = df.groupby('ID')['tDate'].diff().eq(pd.Timedelta(1, unit='d'))))
ID tDate value diff a
0 79 2019-07-12 397 NaT False
1 79 2019-07-13 404 1 days True
2 79 2019-07-18 405 5 days False
3 79 2019-07-19 406 1 days True
4 79 2019-07-20 408 1 days True
5 79 2019-08-03 413 14 days False
6 80 2019-06-21 397 NaT False
7 80 2019-06-22 404 1 days True
8 80 2019-07-18 405 26 days False
9 80 2019-07-19 406 1 days True
10 80 2019-07-26 408 7 days False
11 80 2019-08-02 410 7 days False
12 80 2019-08-03 413 1 days True
Create unique groups by Series.cumsum
with inverted condition by ~
:
print (df.assign(diff= df.groupby('ID')['tDate'].diff(),
a = df.groupby('ID')['tDate'].diff().eq(pd.Timedelta(1, unit='d')),
a_neg = ~a,
s = (~a).cumsum()))
ID tDate value diff a a_neg s
0 79 2019-07-12 397 NaT False True 1
1 79 2019-07-13 404 1 days True False 1
2 79 2019-07-18 405 5 days False True 2
3 79 2019-07-19 406 1 days True False 2
4 79 2019-07-20 408 1 days True False 2
5 79 2019-08-03 413 14 days False True 3
6 80 2019-06-21 397 NaT False True 4
7 80 2019-06-22 404 1 days True False 4
8 80 2019-07-18 405 26 days False True 5
9 80 2019-07-19 406 1 days True False 5
10 80 2019-07-26 408 7 days False True 6
11 80 2019-08-02 410 7 days False True 7
12 80 2019-08-03 413 1 days True False 7
Crete mask by GroupBy.transform
and DataFrameGroupBy.any
for test if each group contains at least one True
- then all values of group are set to True
s:
print (df.assign(diff= df.groupby('ID')['tDate'].diff(),
a = df.groupby('ID')['tDate'].diff().eq(pd.Timedelta(1, unit='d')),
a_neg = ~a,
s = (~a).cumsum(),
mask = a.groupby(s).transform('any')))
ID tDate value consec_count diff a a_neg s mask
0 79 2019-07-12 397 1 NaT False True 1 True
1 79 2019-07-13 404 2 1 days True False 1 True
2 79 2019-07-18 405 1 5 days False True 2 True
3 79 2019-07-19 406 2 1 days True False 2 True
4 79 2019-07-20 408 3 1 days True False 2 True
5 79 2019-08-03 413 0 14 days False True 3 False
6 80 2019-06-21 397 1 NaT False True 4 True
7 80 2019-06-22 404 2 1 days True False 4 True
8 80 2019-07-18 405 1 26 days False True 5 True
9 80 2019-07-19 406 2 1 days True False 5 True
10 80 2019-07-26 408 0 7 days False True 6 False
11 80 2019-08-02 410 1 7 days False True 7 True
12 80 2019-08-03 413 2 1 days True False 7 True
Create counter per groups s
by GroupBy.cumcount
:
print (df.assign(diff= df.groupby('ID')['tDate'].diff(),
a = df.groupby('ID')['tDate'].diff().eq(pd.Timedelta(1, unit='d')),
a_neg = ~a,
s = (~a).cumsum(),
mask = a.groupby(s).transform('any'),
c = df.groupby(s).cumcount(1).add(1)))
ID tDate value consec_count diff a a_neg s mask c
0 79 2019-07-12 397 1 NaT False True 1 True 1
1 79 2019-07-13 404 2 1 days True False 1 True 2
2 79 2019-07-18 405 1 5 days False True 2 True 1
3 79 2019-07-19 406 2 1 days True False 2 True 2
4 79 2019-07-20 408 3 1 days True False 2 True 3
5 79 2019-08-03 413 0 14 days False True 3 False 1
6 80 2019-06-21 397 1 NaT False True 4 True 1
7 80 2019-06-22 404 2 1 days True False 4 True 2
8 80 2019-07-18 405 1 26 days False True 5 True 1
9 80 2019-07-19 406 2 1 days True False 5 True 2
10 80 2019-07-26 408 0 7 days False True 6 False 1
11 80 2019-08-02 410 1 7 days False True 7 True 1
12 80 2019-08-03 413 2 1 days True False 7 True 2
And last add 0
by numpy.where
with mask mask
:
print (df.assign(diff= df.groupby('ID')['tDate'].diff(),
a = df.groupby('ID')['tDate'].diff().eq(pd.Timedelta(1, unit='d')),
a_neg = ~a,
s = (~a).cumsum(),
mask = a.groupby(s).transform('any'),
c = df.groupby(s).cumcount(1).add(1),
out = np.where(mask, df.groupby(s).cumcount(1).add(1), 0)))
ID tDate value consec_count diff a a_neg s mask c out
0 79 2019-07-12 397 1 NaT False True 1 True 1 1
1 79 2019-07-13 404 2 1 days True False 1 True 2 2
2 79 2019-07-18 405 1 5 days False True 2 True 1 1
3 79 2019-07-19 406 2 1 days True False 2 True 2 2
4 79 2019-07-20 408 3 1 days True False 2 True 3 3
5 79 2019-08-03 413 0 14 days False True 3 False 1 0
6 80 2019-06-21 397 1 NaT False True 4 True 1 1
7 80 2019-06-22 404 2 1 days True False 4 True 2 2
8 80 2019-07-18 405 1 26 days False True 5 True 1 1
9 80 2019-07-19 406 2 1 days True False 5 True 2 2
10 80 2019-07-26 408 0 7 days False True 6 False 1 0
11 80 2019-08-02 410 1 7 days False True 7 True 1 1
12 80 2019-08-03 413 2 1 days True False 7 True 2 2
You may also try to create mask on groupby of ID
and or
it with shift(-1)
to flag all consecutive rows True
and assign to mask s1
. Finally, use np.where
on s1
and s1.groupby.cumsum
s = df.groupby('ID').tDate.diff().eq(pd.Timedelta(days=1))
s1 = s | s.shift(-1, fill_value=False)
df['consec_count'] = np.where(s1, s1.groupby(df.ID).cumsum(), 0)
Out[185]:
ID tDate value consec_count
0 79 2019-06-21 397 0
1 79 2019-07-13 404 0
2 79 2019-07-18 405 1
3 79 2019-07-19 406 2
4 79 2019-08-02 410 0
5 79 2019-08-09 413 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With