Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas merge_asof on multiple columns

Tags:

python

pandas

I have two dataframes:

DF1:

StartDate      Location

2013-01-01     20000002
2013-03-01     20000002
2013-08-01     20000002
2013-01-01     20000003
2013-03-01     20000003
2013-05-01     20000003
2013-01-01     20000043

DF2:

EmpStartDate   Location

2012-12-17     20000002.0 
2013-02-25     20000002.0 
2013-06-26     20000002.0 
2012-09-24     20000003.0 
2013-01-07     20000003.0 
2013-07-01     20000043.0

I want the count from DF2 where DF1.Location = DF2.Location and DF2.EmpStartDate<=DF1.StartDate

Output:

StartDate      Location   Count

2013-01-01     20000002   1
2013-03-01     20000002   2
2013-08-01     20000002   3
2013-01-01     20000003   1
2013-03-01     20000003   2
2013-05-01     20000003   2
2013-01-01     20000043   0

I am using merge_asof on DF2.EmpStartDate and DF1.StartDate and then doing a group by on Location and StartDate to achieve this. But I am getting incorrect results as I am merging only on the date columns. I need to merge the dataframes on both Location and Date columns. Looks like merge_asof doesnt support merging on multiple columns. How do I merge the date columns on different groups of location?

like image 600
sks Avatar asked Nov 05 '18 15:11

sks


2 Answers

merge_asof maintains the size of the left DataFrame, so it cannot match the same row in left to multiple rows in right.

A simple, but perhaps memory inefficient way to calculate this is to perform one big merge on Location and then count how many rows have df.EmpStartDate < df.StartDate

df = df1.merge(df2)
(df.assign(Count = df.EmpStartDate < df.StartDate)
   .groupby(['StartDate', 'Location'])
   .Count.sum()
   .astype('int')
   .reset_index())

Output:

   StartDate  Location  Count
0 2013-01-01  20000002      1
1 2013-01-01  20000003      1
2 2013-01-01  20000043      0
3 2013-03-01  20000002      2
4 2013-03-01  20000003      2
5 2013-05-01  20000003      2
6 2013-08-01  20000002      3
like image 144
ALollz Avatar answered Nov 17 '22 10:11

ALollz


Let use this:

df1.merge(df2, on='Location')\
   .query('EmpStartDate <= StartDate')\
   .groupby(['StartDate','Location'])['EmpStartDate']\
   .count()\
   .reindex(df1, fill_value=0)\
   .rename('Count')\
   .reset_index()

Output:

   StartDate  Location  Count
0 2013-01-01  20000002      1
1 2013-03-01  20000002      2
2 2013-08-01  20000002      3
3 2013-01-01  20000003      1
4 2013-03-01  20000003      2
5 2013-05-01  20000003      2
6 2013-01-01  20000043      0
like image 37
Scott Boston Avatar answered Nov 17 '22 09:11

Scott Boston