Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Counting Consecutive Duplicates For By Group

I have a dataset with an id column, date column and value. I would like to count the consecutive appearances/duplicate values of id for a continuous date range.

My question is very much like Count consecutive duplicate values by group but in Python. Moreover, the question is different from How to find duplicates in pandas dataframe because I need the count to be based on two columns where one is not identical - it is the date (which changes, but if it's consecutive I want to count it)

Here is a sample dataset:

ID      tDate            value
79  2019-06-21 00:00:00  397
79  2019-07-13 00:00:00  404
79  2019-07-18 00:00:00  405
79  2019-07-19 00:00:00  406
79  2019-08-02 00:00:00  410
79  2019-08-09 00:00:00  413

I would like to resultant dataset to be:

ID      tDate            val  consec_count
79  2019-06-21 00:00:00  397  0
79  2019-07-13 00:00:00  404  0
79  2019-07-18 00:00:00  405  1
79  2019-07-19 00:00:00  406  2
79  2019-08-02 00:00:00  410  0
79  2019-08-09 00:00:00  413  0

I marked the "singles" with 0 and not 1 because I need to tell the two apart. I will deal with batches of 'duplicates' differently than single records.

Thank you!

like image 421
sa_zy Avatar asked Sep 23 '19 07:09

sa_zy


2 Answers

Sample:

df = pd.DataFrame({'ID': [79, 79, 79, 79, 79, 79, 80, 80, 80, 80, 80, 80, 80], 
                   'tDate': [pd.Timestamp('2019-07-12 00:00:00'),
                             pd.Timestamp('2019-07-13 00:00:00'),
                             pd.Timestamp('2019-07-18 00:00:00'),
                             pd.Timestamp('2019-07-19 00:00:00'),
                             pd.Timestamp('2019-07-20 00:00:00'),
                             pd.Timestamp('2019-08-03 00:00:00'), 
                             pd.Timestamp('2019-06-21 00:00:00'), 
                             pd.Timestamp('2019-06-22 00:00:00'), 
                             pd.Timestamp('2019-07-18 00:00:00'), 
                             pd.Timestamp('2019-07-19 00:00:00'), 
                             pd.Timestamp('2019-07-26 00:00:00'), 
                             pd.Timestamp('2019-08-02 00:00:00'), 
                             pd.Timestamp('2019-08-03 00:00:00')],
                   'value':[397, 404, 405, 406, 408, 413, 397, 404, 405, 406, 408, 410, 413]})

print (df)
    ID      tDate  value
0   79 2019-07-12    397
1   79 2019-07-13    404
2   79 2019-07-18    405
3   79 2019-07-19    406
4   79 2019-07-20    408
5   79 2019-08-03    413
6   80 2019-06-21    397
7   80 2019-06-22    404
8   80 2019-07-18    405
9   80 2019-07-19    406
10  80 2019-07-26    408
11  80 2019-08-02    410
12  80 2019-08-03    413

Solution:

a = df.groupby('ID')['tDate'].diff().eq(pd.Timedelta(1, unit='d'))
s = (~a).cumsum()
df['consec_count']=np.where(a.groupby(s).transform('any'), df.groupby(s).cumcount(1).add(1),0)

print (df)
    ID      tDate  value  consec_count
0   79 2019-07-12    397             1
1   79 2019-07-13    404             2
2   79 2019-07-18    405             1
3   79 2019-07-19    406             2
4   79 2019-07-20    408             3
5   79 2019-08-03    413             0
6   80 2019-06-21    397             1
7   80 2019-06-22    404             2
8   80 2019-07-18    405             1
9   80 2019-07-19    406             2
10  80 2019-07-26    408             0
11  80 2019-08-02    410             1
12  80 2019-08-03    413             2

Explanation:

First create mask for compare difference per groups by DataFrameGroupBy.diff with one day:

print (df.assign(diff= df.groupby('ID')['tDate'].diff(),
                 a = df.groupby('ID')['tDate'].diff().eq(pd.Timedelta(1, unit='d'))))
    ID      tDate  value    diff      a
0   79 2019-07-12    397     NaT  False
1   79 2019-07-13    404  1 days   True
2   79 2019-07-18    405  5 days  False
3   79 2019-07-19    406  1 days   True
4   79 2019-07-20    408  1 days   True
5   79 2019-08-03    413 14 days  False
6   80 2019-06-21    397     NaT  False
7   80 2019-06-22    404  1 days   True
8   80 2019-07-18    405 26 days  False
9   80 2019-07-19    406  1 days   True
10  80 2019-07-26    408  7 days  False
11  80 2019-08-02    410  7 days  False
12  80 2019-08-03    413  1 days   True

Create unique groups by Series.cumsum with inverted condition by ~:

print (df.assign(diff= df.groupby('ID')['tDate'].diff(),
                 a = df.groupby('ID')['tDate'].diff().eq(pd.Timedelta(1, unit='d')),
                 a_neg = ~a,
                 s = (~a).cumsum()))

    ID      tDate  value    diff      a  a_neg  s
0   79 2019-07-12    397     NaT  False   True  1
1   79 2019-07-13    404  1 days   True  False  1
2   79 2019-07-18    405  5 days  False   True  2
3   79 2019-07-19    406  1 days   True  False  2
4   79 2019-07-20    408  1 days   True  False  2
5   79 2019-08-03    413 14 days  False   True  3
6   80 2019-06-21    397     NaT  False   True  4
7   80 2019-06-22    404  1 days   True  False  4
8   80 2019-07-18    405 26 days  False   True  5
9   80 2019-07-19    406  1 days   True  False  5
10  80 2019-07-26    408  7 days  False   True  6
11  80 2019-08-02    410  7 days  False   True  7
12  80 2019-08-03    413  1 days   True  False  7

Crete mask by GroupBy.transform and DataFrameGroupBy.any for test if each group contains at least one True - then all values of group are set to Trues:

print (df.assign(diff= df.groupby('ID')['tDate'].diff(),
                 a = df.groupby('ID')['tDate'].diff().eq(pd.Timedelta(1, unit='d')),
                 a_neg = ~a,
                 s = (~a).cumsum(),
                 mask = a.groupby(s).transform('any')))

    ID      tDate  value  consec_count    diff      a  a_neg  s   mask
0   79 2019-07-12    397             1     NaT  False   True  1   True
1   79 2019-07-13    404             2  1 days   True  False  1   True
2   79 2019-07-18    405             1  5 days  False   True  2   True
3   79 2019-07-19    406             2  1 days   True  False  2   True
4   79 2019-07-20    408             3  1 days   True  False  2   True
5   79 2019-08-03    413             0 14 days  False   True  3  False
6   80 2019-06-21    397             1     NaT  False   True  4   True
7   80 2019-06-22    404             2  1 days   True  False  4   True
8   80 2019-07-18    405             1 26 days  False   True  5   True
9   80 2019-07-19    406             2  1 days   True  False  5   True
10  80 2019-07-26    408             0  7 days  False   True  6  False
11  80 2019-08-02    410             1  7 days  False   True  7   True
12  80 2019-08-03    413             2  1 days   True  False  7   True

Create counter per groups s by GroupBy.cumcount:

print (df.assign(diff= df.groupby('ID')['tDate'].diff(),
                 a = df.groupby('ID')['tDate'].diff().eq(pd.Timedelta(1, unit='d')),
                 a_neg = ~a,
                 s = (~a).cumsum(),
                 mask = a.groupby(s).transform('any'),
                 c = df.groupby(s).cumcount(1).add(1)))

    ID      tDate  value  consec_count    diff      a  a_neg  s   mask  c
0   79 2019-07-12    397             1     NaT  False   True  1   True  1
1   79 2019-07-13    404             2  1 days   True  False  1   True  2
2   79 2019-07-18    405             1  5 days  False   True  2   True  1
3   79 2019-07-19    406             2  1 days   True  False  2   True  2
4   79 2019-07-20    408             3  1 days   True  False  2   True  3
5   79 2019-08-03    413             0 14 days  False   True  3  False  1
6   80 2019-06-21    397             1     NaT  False   True  4   True  1
7   80 2019-06-22    404             2  1 days   True  False  4   True  2
8   80 2019-07-18    405             1 26 days  False   True  5   True  1
9   80 2019-07-19    406             2  1 days   True  False  5   True  2
10  80 2019-07-26    408             0  7 days  False   True  6  False  1
11  80 2019-08-02    410             1  7 days  False   True  7   True  1
12  80 2019-08-03    413             2  1 days   True  False  7   True  2

And last add 0 by numpy.where with mask mask:

print (df.assign(diff= df.groupby('ID')['tDate'].diff(),
                 a = df.groupby('ID')['tDate'].diff().eq(pd.Timedelta(1, unit='d')),
                 a_neg = ~a,
                 s = (~a).cumsum(),
                 mask = a.groupby(s).transform('any'),
                 c = df.groupby(s).cumcount(1).add(1),
                 out =  np.where(mask, df.groupby(s).cumcount(1).add(1), 0)))

    ID      tDate  value  consec_count    diff      a  a_neg  s   mask  c  out
0   79 2019-07-12    397             1     NaT  False   True  1   True  1    1
1   79 2019-07-13    404             2  1 days   True  False  1   True  2    2
2   79 2019-07-18    405             1  5 days  False   True  2   True  1    1
3   79 2019-07-19    406             2  1 days   True  False  2   True  2    2
4   79 2019-07-20    408             3  1 days   True  False  2   True  3    3
5   79 2019-08-03    413             0 14 days  False   True  3  False  1    0
6   80 2019-06-21    397             1     NaT  False   True  4   True  1    1
7   80 2019-06-22    404             2  1 days   True  False  4   True  2    2
8   80 2019-07-18    405             1 26 days  False   True  5   True  1    1
9   80 2019-07-19    406             2  1 days   True  False  5   True  2    2
10  80 2019-07-26    408             0  7 days  False   True  6  False  1    0
11  80 2019-08-02    410             1  7 days  False   True  7   True  1    1
12  80 2019-08-03    413             2  1 days   True  False  7   True  2    2
like image 200
jezrael Avatar answered Oct 15 '22 11:10

jezrael


You may also try to create mask on groupby of ID and or it with shift(-1) to flag all consecutive rows True and assign to mask s1. Finally, use np.where on s1 and s1.groupby.cumsum

s = df.groupby('ID').tDate.diff().eq(pd.Timedelta(days=1))
s1 = s | s.shift(-1, fill_value=False)
df['consec_count'] = np.where(s1, s1.groupby(df.ID).cumsum(), 0)

Out[185]:
   ID      tDate  value  consec_count
0  79 2019-06-21    397             0
1  79 2019-07-13    404             0
2  79 2019-07-18    405             1
3  79 2019-07-19    406             2
4  79 2019-08-02    410             0
5  79 2019-08-09    413             0
like image 1
Andy L. Avatar answered Oct 15 '22 11:10

Andy L.