Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas count consecutive date observations within groupby object

Tags:

python

pandas

This is an example of the data frame i'm working with:

d = {
'item_number':['bdsm1000', 'bdsm1000', 'bdsm1000', 'ZZRWB18','ZZRWB18', 'ZZRWB18', 'ZZRWB18', 'ZZHP1427BLK', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1414', 'ZZHP1414', 'ZZHP1414', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE'],
'Comp_ID':[2454, 2454, 2454, 1395, 1395, 1395, 1395, 3378, 1266941, 660867, 43978, 1266941, 660867, 43978, 1266941, 660867, 43978, 1266941, 660867, 43978, 43978, 43978, 43978, 1197347907, 70745, 4737, 1197347907, 4737, 1197347907, 70745, 4737, 1197347907, 70745, 4737, 1197347907, 4737, 1197487704, 1197347907, 70745, 23872, 4737, 1197347907, 4737, 1197487704, 1197347907, 23872, 4737, 1197487704, 1197347907, 70745],
'date':['2016-11-22', '2016-11-20', '2016-11-19', '2016-11-22', '2016-11-20', '2016-11-19', '2016-11-18', '2016-11-22', '2016-11-22', '2016-11-22', '2016-11-22', '2016-11-20', '2016-11-20', '2016-11-20', '2016-11-19', '2016-11-19', '2016-11-19', '2016-11-18', '2016-11-18', '2016-11-18', '2016-11-22', '2016-11-20', '2016-11-19', '2016-11-22', '2016-11-22', '2016-11-22', '2016-11-21', '2016-11-21', '2016-11-20', '2016-11-20', '2016-11-20', '2016-11-19', '2016-11-19', '2016-11-19', '2016-11-18', '2016-11-18', '2016-11-22', '2016-11-22', '2016-11-22', '2016-11-22', '2016-11-22', '2016-11-21', '2016-11-21', '2016-11-20', '2016-11-20', '2016-11-20', '2016-11-20', '2016-11-19', '2016-11-19', '2016-11-19']}

df = pd.DataFrame(data=d)
df.date = pd.to_datetime(df.date)

I'd like to count consecutive observations starting from 2016-11-22 that there are grouped by Comp_ID and item_number.

Essentially, what I am looking to do, is count how many days in a row there is an observation counting back from todays date for each Comp_ID and item_number. (this example was put together on the 22nd of Nov) Consecutive observations observed weeks/ days prior to today are not relevant. Only sequences like today... yesterday... the day before yesterday... and so on are relevant.

I got this to work on a smaller sample, but it seems to be getting tripped up on a larger data-set.

Here is the code for the smaller sample. I need to find the consecutive dates with observations across thousands of sellers/ items. For some reason, the below code did not work on the larger data set.

d = {'item_number':['KIN005','KIN005','KIN005','KIN005','KIN005','A789B','A789B','A789B','G123H','G123H','G123H'],
'Comp_ID':['1395','1395','1395','1395','1395','7787','7787','7787','1395','1395','1395'],
'date':['2016-11-22','2016-11-21','2016-11-20','2016-11-14','2016-11-13','2016-11-22','2016-11-21','2016-11-12','2016-11-22','2016-11-21','2016-11-08']}

df = pd.DataFrame(data=d)
df.date = pd.to_datetime(df.date)
d = pd.Timedelta(1, 'D')

df = df.sort_values(['item_number','date','Comp_ID'],ascending=False)

g = df.groupby(['Comp_ID','item_number'])
sequence = g['date'].apply(lambda x: x.diff().fillna(0).abs().le(d)).reset_index()
sequence.set_index('index',inplace=True)
test = df.join(sequence)
test.columns = ['Comp_ID','date','item_number','consecutive']
g = test.groupby(['Comp_ID','item_number'])
g['consecutive'].apply(lambda x: x.idxmin() - x.idxmax() )

This gets the desired result for the smaller data-set:

Comp_ID  item_number
1395     G123H          2
         KIN005         3
7787     KIN005         2
Name: consecutive, dtype: int64
like image 928
Yale Newman Avatar asked Nov 25 '16 18:11

Yale Newman


People also ask

How do you count after GroupBy in pandas?

Use count() by Column Name Use pandas DataFrame. groupby() to group the rows by column and use count() method to get the count for each group by ignoring None and Nan values. It works with non-floating type data as well.

How do you check for consecutive days in Python?

In this, we check consecutive dates by checking days difference from the previous date using days(). The iteration of all dates is done using a loop.

How do you count the number of occurrences in pandas?

Using the size() or count() method with pandas. DataFrame. groupby() will generate the count of a number of occurrences of data present in a particular column of the dataframe.


1 Answers

You can do it this way:

today = pd.to_datetime('2016-11-22')

# sort DF by `date` (descending)    
x = df.sort_values('date', ascending=0)
g = x.groupby(['Comp_ID','item_number'])
# compare the # of days to `today` with a consecutive day# in each group
x[(today - x['date']).dt.days == g.cumcount()].groupby(['Comp_ID','item_number']).size()

Result:

Comp_ID  item_number
1395     G123H          2
         KIN005         3
7787     A789B          2
dtype: int64

PS thanks to @DataSwede's for faster diff calculation!

Explanation:

In [124]: x[(today - x['date']).dt.days == g.cumcount()] \
           .sort_values(['Comp_ID','item_number','date'], ascending=[1,1,0])
Out[124]:
  Comp_ID       date item_number
8    1395 2016-11-22       G123H
9    1395 2016-11-21       G123H
0    1395 2016-11-22      KIN005
1    1395 2016-11-21      KIN005
2    1395 2016-11-20      KIN005
5    7787 2016-11-22       A789B
6    7787 2016-11-21       A789B
like image 84
MaxU - stop WAR against UA Avatar answered Oct 30 '22 18:10

MaxU - stop WAR against UA