Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to join data frames based on condition between 2 columns

Tags:

merge

dataframe

r

I am stuck with a project where I need to merge two data frames. They look something like this:

Data1
Traffic Source    Registrations    Hour    Minute
organic           1                6        13
social            1                8        54

Data2
Email                     Hour2   Minute2
[email protected]           6         13
[email protected]         8         55

I have the following line of code to merge the 2 data frames:

merge.df <- merge(Data1, Data2, by.x = c( "Hour", "Minute"),
           by.y = c( "Hour2", "Minute2"))

It would work great if the variable time (hours & minutes) wasn't slightly off between the two data sets. Is there a way to make the column "Minute" match with "Minute2" if it's + or - one minute off?

I thought I could create 2 new columns for data set one:

Data1
Traffic Source    Registrations   Hour   Minute    Minute_plus1   Minute_minus1
organic           1               6        13      14              12
social            1               8        54      55              53

Is it possible to merge the 2 data frames if "Minute2" matches any variable from either "Minute", "Minute_plus1", or "Minute_minus1"? Or is there a more efficient way to accomplish this merge?

like image 952
heyydrien Avatar asked Apr 28 '15 18:04

heyydrien


People also ask

How do I merge two DataFrames based on column values?

To merge two Pandas DataFrame with common column, use the merge() function and set the ON parameter as the column name.

How do you join two DataFrames on a condition?

Dataframes in Pandas can be merged using pandas. merge() method. Returns : A DataFrame of the two merged objects. While working on datasets there may be a need to merge two data frames with some complex conditions, below are some examples of merging two data frames with some complex conditions.

How do you join two DataFrames by 2 columns so they have only the common rows?

How do I join two DataFrames based on two columns? The concat() function can be used to concatenate two Dataframes by adding the rows of one to the other. The merge() function is equivalent to the SQL JOIN clause. 'left', 'right' and 'inner' joins are all possible.

How do I join a DataFrame based on a column?

Join DataFrames using their indexes. If we want to join using the key columns, we need to set key to be the index in both df and other . The joined DataFrame will have key as its index. Another option to join using the key columns is to use the on parameter.


1 Answers

For stuff like this I usually turn to SQL:

library(sqldf)
x = sqldf("
  SELECT *
  FROM Data1 d1 JOIN Data2 d2
  ON d1.Hour = d2.Hour2
  AND ABS(d1.Minute - d2.Minute2) <= 1
")

Depending on the size of your data, you could also just join on Hour and then filter. Using dplyr:

library(dplyr)
x = Data1 %>%
  left_join(Data2, by = c("Hour" = "Hour2")) %>%
  filter(abs(Minute - Minute2) <= 1)

though you could do the same thing with base functions.

like image 123
Gregor Thomas Avatar answered Oct 17 '22 18:10

Gregor Thomas