I am stuck with a project where I need to merge two data frames. They look something like this:
Data1
Traffic Source Registrations Hour Minute
organic 1 6 13
social 1 8 54
Data2
Email Hour2 Minute2
[email protected] 6 13
[email protected] 8 55
I have the following line of code to merge the 2 data frames:
merge.df <- merge(Data1, Data2, by.x = c( "Hour", "Minute"),
by.y = c( "Hour2", "Minute2"))
It would work great if the variable time (hours & minutes) wasn't slightly off between the two data sets. Is there a way to make the column "Minute" match with "Minute2" if it's + or - one minute off?
I thought I could create 2 new columns for data set one:
Data1
Traffic Source Registrations Hour Minute Minute_plus1 Minute_minus1
organic 1 6 13 14 12
social 1 8 54 55 53
Is it possible to merge the 2 data frames if "Minute2" matches any variable from either "Minute", "Minute_plus1", or "Minute_minus1"? Or is there a more efficient way to accomplish this merge?
To merge two Pandas DataFrame with common column, use the merge() function and set the ON parameter as the column name.
Dataframes in Pandas can be merged using pandas. merge() method. Returns : A DataFrame of the two merged objects. While working on datasets there may be a need to merge two data frames with some complex conditions, below are some examples of merging two data frames with some complex conditions.
How do I join two DataFrames based on two columns? The concat() function can be used to concatenate two Dataframes by adding the rows of one to the other. The merge() function is equivalent to the SQL JOIN clause. 'left', 'right' and 'inner' joins are all possible.
Join DataFrames using their indexes. If we want to join using the key columns, we need to set key to be the index in both df and other . The joined DataFrame will have key as its index. Another option to join using the key columns is to use the on parameter.
For stuff like this I usually turn to SQL:
library(sqldf)
x = sqldf("
SELECT *
FROM Data1 d1 JOIN Data2 d2
ON d1.Hour = d2.Hour2
AND ABS(d1.Minute - d2.Minute2) <= 1
")
Depending on the size of your data, you could also just join on Hour
and then filter. Using dplyr
:
library(dplyr)
x = Data1 %>%
left_join(Data2, by = c("Hour" = "Hour2")) %>%
filter(abs(Minute - Minute2) <= 1)
though you could do the same thing with base
functions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With