Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count Number of Rows Between Two Dates BY ID in a Pandas GroupBy Dataframe

I have the following test DataFrame:

import random
from datetime import timedelta
import pandas as pd
import datetime

#create test range of dates
rng=pd.date_range(datetime.date(2015,1,1),datetime.date(2015,7,31))
rnglist=rng.tolist()
testpts = range(100,121)
#create test dataframe
d={'jid':[i for i in range(100,121)], 'cid':[random.randint(1,2) for _ in testpts],
    'stdt':[rnglist[random.randint(0,len(rng))] for _ in testpts]}
df=pd.DataFrame(d)
df['enddt'] = df['stdt']+timedelta(days=random.randint(2,32))

Which gives a dataframe like the below, with a company id column 'cid', a unique id column 'jid', a start date 'stdt', and an enddt 'enddt'.

   cid  jid       stdt      enddt
0    1  100 2015-07-06 2015-07-13
1    1  101 2015-07-15 2015-07-22
2    2  102 2015-07-12 2015-07-19
3    2  103 2015-07-07 2015-07-14
4    2  104 2015-07-14 2015-07-21
5    1  105 2015-07-11 2015-07-18
6    1  106 2015-07-12 2015-07-19
7    2  107 2015-07-01 2015-07-08
8    2  108 2015-07-10 2015-07-17
9    2  109 2015-07-09 2015-07-16

What I need to do is the following: Count the number of jid that occur by cid, for each date(newdate) between the min(stdt) and max(enddt), where the newdate is between the stdt and the enddt.

The resulting data set should be a dataframe that has for each cid, a column range of dates (newdate) that is between the min(stdt) and the max(enddt) specific to each cid, and a count (cnt) of the number of jid that the newdate is between of the min(stdt) and max(enddt). That resulting DataFrame should look like (this is just for 1 cid using above data):

cid newdate cnt
1   2015-07-06  1
1   2015-07-07  1
1   2015-07-08  1
1   2015-07-09  1
1   2015-07-10  1
1   2015-07-11  2
1   2015-07-12  3
1   2015-07-13  3
1   2015-07-14  2
1   2015-07-15  3
1   2015-07-16  3
1   2015-07-17  3
1   2015-07-18  3
1   2015-07-19  2
1   2015-07-20  1
1   2015-07-21  1
1   2015-07-22  1

I believe there should be a way to use pandas groupby (groupby cid), and some form of lambda(?) to pythonically create this new dataframe.

I currently run a loop that for each cid (I slice the cid rows out of the master df), in the loop determine the relevant date range (min stdt and max enddt for each cid frame, then for each of those newdates (range mindate-maxdate) it counts the number of jid where the newdate is between the stdt and enddt of each jid. Then I append each resulting dataset into a new dataframe which looks like the above.

But this is very expensive from a resource and time perspective. Doing this on millions of jid for thousands of cid literally takes a full day. I am hoping there is a simple(r) pandas solution here.

like image 664
clg4 Avatar asked Aug 02 '15 14:08

clg4


People also ask

How do you count rows in Groupby pandas?

The most simple method for pandas groupby count is by using the in-built pandas method named size(). It returns a pandas series that possess the total number of row count for each group. The basic working of the size() method is the same as len() method and hence, it is not affected by NaN values in the dataset.

How do I count the number of rows with a specific value in pandas?

Use Sum Function to Count Specific Values in a Column in a Dataframe. We can use the sum() function on a specified column to count values equal to a set condition, in this case we use == to get just rows equal to our specific data point.

How do you count the number of occurrences in pandas DataFrame?

Using the size() or count() method with pandas. DataFrame. groupby() will generate the count of a number of occurrences of data present in a particular column of the dataframe.


2 Answers

My usual approach for these problems is to pivot and think in terms of events changing an accumulator. Every new "stdt" we see adds +1 to the count; every "enddt" we see adds -1. (Adds -1 the next day, at least if I'm interpreting "between" the way you are. Some days I think we should ban the use of the word as too ambiguous..)

IOW, if we turn your frame to something like

>>> df.head()
    cid  jid  change       date
0     1  100       1 2015-01-06
1     1  101       1 2015-01-07
21    1  100      -1 2015-01-16
22    1  101      -1 2015-01-17
17    1  117       1 2015-03-01

then what we want is simply the cumulative sum of change (after suitable regrouping.) For example, something like

df["enddt"] += timedelta(days=1)
df = pd.melt(df, id_vars=["cid", "jid"], var_name="change", value_name="date")
df["change"] = df["change"].replace({"stdt": 1, "enddt": -1})
df = df.sort(["cid", "date"])

df = df.groupby(["cid", "date"],as_index=False)["change"].sum()
df["count"] = df.groupby("cid")["change"].cumsum()

new_time = pd.date_range(df.date.min(), df.date.max())

df_parts = []
for cid, group in df.groupby("cid"):
    full_count = group[["date", "count"]].set_index("date")
    full_count = full_count.reindex(new_time)
    full_count = full_count.ffill().fillna(0)
    full_count["cid"] = cid
    df_parts.append(full_count)

df_new = pd.concat(df_parts)

which gives me something like

>>> df_new.head(15)
            count  cid
2015-01-03      0    1
2015-01-04      0    1
2015-01-05      0    1
2015-01-06      1    1
2015-01-07      2    1
2015-01-08      2    1
2015-01-09      2    1
2015-01-10      2    1
2015-01-11      2    1
2015-01-12      2    1
2015-01-13      2    1
2015-01-14      2    1
2015-01-15      2    1
2015-01-16      1    1
2015-01-17      0    1

There may be off-by-one differences with regards to your expectations; you may have different ideas about how you should handle multiple overlapping jids in the same time window (here they would count as 2); but the basic idea of working with the events should prove useful even if you have to tweak the details.

like image 127
DSM Avatar answered Nov 06 '22 08:11

DSM


Here is a solution I came up with (this will loop through the permutations of unique cid's and date range getting your counts):

from itertools import product
df_new_date=pd.DataFrame(list(product(df.cid.unique(),pd.date_range(df.stdt.min(), df.enddt.max()))),columns=['cid','newdate'])
df_new_date['cnt']=df_new_date.apply(lambda row:df[(df['cid']==row['cid'])&(df['stdt']<=row['newdate'])&(df['enddt']>=row['newdate'])]['jid'].count(),axis=1)

>>> df_new_date.head(20) 
    cid    newdate  cnt
0     1 2015-07-01    0
1     1 2015-07-02    0
2     1 2015-07-03    0
3     1 2015-07-04    0
4     1 2015-07-05    0
5     1 2015-07-06    1
6     1 2015-07-07    1
7     1 2015-07-08    1
8     1 2015-07-09    1
9     1 2015-07-10    1
10    1 2015-07-11    2
11    1 2015-07-12    3
12    1 2015-07-13    3
13    1 2015-07-14    2
14    1 2015-07-15    3
15    1 2015-07-16    3
16    1 2015-07-17    3
17    1 2015-07-18    3
18    1 2015-07-19    2
19    1 2015-07-20    1

You could then drop the zeros if you don't want them. I don't think this will be much better than your original solution, however.

I would like to suggest you use the following improvement on the loop provided by the @DSM solution:

df_parts=[]
for cid in df.cid.unique():
    full_count=df[(df.cid==cid)][['cid','date','count']].set_index("date").asfreq("D", method='ffill')[['cid','count']].reset_index()
    df_parts.append(full_count[full_count['count']!=0])

df_new = pd.concat(df_parts)

>>> df_new
         date  cid  count
0  2015-07-06    1      1
1  2015-07-07    1      1
2  2015-07-08    1      1
3  2015-07-09    1      1
4  2015-07-10    1      1
5  2015-07-11    1      2
6  2015-07-12    1      3
7  2015-07-13    1      3
8  2015-07-14    1      2
9  2015-07-15    1      3
10 2015-07-16    1      3
11 2015-07-17    1      3
12 2015-07-18    1      3
13 2015-07-19    1      2
14 2015-07-20    1      1
15 2015-07-21    1      1
16 2015-07-22    1      1
0  2015-07-01    2      1
1  2015-07-02    2      1
2  2015-07-03    2      1
3  2015-07-04    2      1
4  2015-07-05    2      1
5  2015-07-06    2      1
6  2015-07-07    2      2
7  2015-07-08    2      2
8  2015-07-09    2      2
9  2015-07-10    2      3
10 2015-07-11    2      3
11 2015-07-12    2      4
12 2015-07-13    2      4
13 2015-07-14    2      5
14 2015-07-15    2      4
15 2015-07-16    2      4
16 2015-07-17    2      3
17 2015-07-18    2      2
18 2015-07-19    2      2
19 2015-07-20    2      1
20 2015-07-21    2      1

Only real improvement over what @DSM provided is that this will avoid requiring the creation of a groubby object for the loop and this will also get you all the min stdt and max enddt per cid number with no zero values.

like image 39
khammel Avatar answered Nov 06 '22 08:11

khammel