I have a dataframe 'df' that looks like this:
id date1 date2
1 11/1/2016 11/1/2016
1 11/1/2016 11/2/2016
1 11/1/2016 11/1/2016
1 11/1/2016 11/2/2016
1 11/2/2016 11/2/2016
2 11/1/2016 11/1/2016
2 11/1/2016 11/2/2016
2 11/1/2016 11/1/2016
2 11/2/2016 11/2/2016
2 11/2/2016 11/2/2016
What I would like to do is to groupby the id, then get the size for each id where date1=date2. The result should look like:
id samedate count
1 11/1/2016 2
1 11/2/2016 1
2 11/1/2016 2
2 11/2/2016 2
I have tried this:
gb=df.groupby(id').apply(lambda x: x[x.date1== x.date2]['date1'].size())
And get this error:
TypeError: 'int' object is not callable
You could certainly flag each instance where the date1 and date2 are equal, then count those flags for each id by each samedate, but I have to believe there is a groupby option for this.
This particular syntax groups the rows of the DataFrame based on var1 and then counts the number of rows where var2 is equal to 'val. ' The following example shows how to use this syntax in practice.
The results show that apply massively outperforms iterrows . As mentioned previously, this is because apply is optimized for looping through dataframe rows much quicker than iterrows does. While slower than apply , itertuples is quicker than iterrows , so if looping is required, try implementing itertuples instead.
You can use boolean indexing
first and then aggregate size
:
df.date1 = pd.to_datetime(df.date1)
df.date2 = pd.to_datetime(df.date2)
df = df[df.date1 == df.date2]
gb=df.groupby(['id', 'date1']).size().reset_index(name='count')
print (gb)
id date1 count
0 1 2016-11-01 2
1 1 2016-11-02 1
2 2 2016-11-01 2
3 2 2016-11-02 2
Timings:
In [79]: %timeit (df[df.date1 == df.date2].groupby(['id', 'date1']).size().reset_index(name='count'))
100 loops, best of 3: 3.84 ms per loop
In [80]: %timeit (df.groupby(['id', 'date1']).apply(lambda x: (x['date1'] == x['date2']).sum()).reset_index())
100 loops, best of 3: 7.57 ms per loop
Code for timings:
#len df = 10k
df = pd.concat([df]*1000).reset_index(drop=True)
#print (df)
df.date1 = pd.to_datetime(df.date1)
df.date2 = pd.to_datetime(df.date2)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With