Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Pandas Dataframe GroupBy Size based on condition

I have a dataframe 'df' that looks like this:

id  date1   date2
1   11/1/2016   11/1/2016
1   11/1/2016   11/2/2016
1   11/1/2016   11/1/2016
1   11/1/2016   11/2/2016
1   11/2/2016   11/2/2016
2   11/1/2016   11/1/2016
2   11/1/2016   11/2/2016
2   11/1/2016   11/1/2016
2   11/2/2016   11/2/2016
2   11/2/2016   11/2/2016

What I would like to do is to groupby the id, then get the size for each id where date1=date2. The result should look like:

id  samedate    count
1   11/1/2016    2 
1   11/2/2016    1 
2   11/1/2016    2 
2   11/2/2016    2 

I have tried this:

gb=df.groupby(id').apply(lambda x: x[x.date1== x.date2]['date1'].size())

And get this error:

TypeError: 'int' object is not callable

You could certainly flag each instance where the date1 and date2 are equal, then count those flags for each id by each samedate, but I have to believe there is a groupby option for this.

like image 472
clg4 Avatar asked Nov 27 '16 19:11

clg4


People also ask

How do you use groupby with condition?

This particular syntax groups the rows of the DataFrame based on var1 and then counts the number of rows where var2 is equal to 'val. ' The following example shows how to use this syntax in practice.

Is Iterrows faster than apply?

The results show that apply massively outperforms iterrows . As mentioned previously, this is because apply is optimized for looping through dataframe rows much quicker than iterrows does. While slower than apply , itertuples is quicker than iterrows , so if looping is required, try implementing itertuples instead.


1 Answers

You can use boolean indexing first and then aggregate size:

df.date1 = pd.to_datetime(df.date1)
df.date2 = pd.to_datetime(df.date2)

df = df[df.date1 == df.date2]
gb=df.groupby(['id', 'date1']).size().reset_index(name='count')
print (gb)
   id      date1  count
0   1 2016-11-01      2
1   1 2016-11-02      1
2   2 2016-11-01      2
3   2 2016-11-02      2

Timings:

In [79]: %timeit (df[df.date1 == df.date2].groupby(['id', 'date1']).size().reset_index(name='count'))
100 loops, best of 3: 3.84 ms per loop

In [80]: %timeit (df.groupby(['id', 'date1']).apply(lambda x: (x['date1'] == x['date2']).sum()).reset_index())
100 loops, best of 3: 7.57 ms per loop

Code for timings:

#len df = 10k
df = pd.concat([df]*1000).reset_index(drop=True)
#print (df)

df.date1 = pd.to_datetime(df.date1)
df.date2 = pd.to_datetime(df.date2)
like image 196
jezrael Avatar answered Sep 18 '22 18:09

jezrael