I'm working with two different datasets that I want to merge based on a threshold. Let's say the two dataframes look like this:
library(dplyr)
library(fuzzyjoin)
library(lubridate)
df1 = data_frame(Item=1:5,
DateTime=c("2015-01-01 11:12:14", "2015-01-02 09:15:23",
"2015-01-02 15:46:11", "2015-04-19 22:11:33",
"2015-06-10 07:00:00"),
Count=c(1, 6, 11, 15, 9),
Name="Sterling",
Friend=c("Pam", "Cyril", "Cheryl", "Mallory", "Lana"))
df1$DateTime = ymd_hms(df1$DateTime)
df2 = data_frame(Item=21:25,
DateTime=c("2015-01-01 11:12:15", "2015-01-02 19:15:23",
"2015-01-02 15:46:11", "2015-05-19 22:11:33",
"2015-06-10 07:00:02"),
Count=c(3, 7, 11, 15, 8),
Name="Sterling",
Friend=c("Pam", "Kreger", "Woodhouse", "Gillete", "Lana"))
df2$DateTime = ymd_hms(df2$DateTime)
What I would like now, is to be able to left join df2
with df1
based on a fuzzy match of DateTime
and Count
being within two seconds of their respective values, while all other values except Item
are identical. I thought I could get there with the following:
df1 %>%
difference_left_join(df2, by=c("DateTime", "Count"), max_dist=2)
But that gives me the following output:
# A tibble: 8 × 10
Item.x DateTime.x Count.x Name.x Friend.x Item.y DateTime.y Count.y Name.y Friend.y
<int> <dttm> <dbl> <chr> <chr> <int> <dttm> <dbl> <chr> <chr>
1 1 2015-01-01 11:12:14 1 Sterling Pam 21 2015-01-01 11:12:15 3 Sterling Pam
2 1 2015-01-01 11:12:14 1 Sterling Pam 21 2015-01-01 11:12:15 3 Sterling Pam
3 2 2015-01-02 09:15:23 6 Sterling Cyril NA <NA> NA <NA> <NA>
4 3 2015-01-02 15:46:11 11 Sterling Cheryl 23 2015-01-02 15:46:11 11 Sterling Woodhouse
5 3 2015-01-02 15:46:11 11 Sterling Cheryl 23 2015-01-02 15:46:11 11 Sterling Woodhouse
6 4 2015-04-19 22:11:33 15 Sterling Mallory NA <NA> NA <NA> <NA>
7 5 2015-06-10 07:00:00 9 Sterling Lana 25 2015-06-10 07:00:02 8 Sterling Lana
8 5 2015-06-10 07:00:00 9 Sterling Lana 25 2015-06-10 07:00:02 8 Sterling Lana
This is close, except that row 3 should not have merged given that the names are different (and I would have expected row 2 to merge given the thresholds, even though I don't want it to).
How do I end up with the following dataframe? Note that the second row and third row from df2
was not merged despite DateTime
and Count
meeting the threshold limit. This is because the other columns (except Item
) were not identical.
desired_output
# Item DateTime Count Name Friend
# 1 3 2015-01-02 15:46:11 11 Sterling Cheryl
# 2 2 2015-01-02 09:15:23 6 Sterling Cyril
# 3 5 2015-06-10 07:00:00 9 Sterling Lana
# 4 25 2015-06-10 07:00:02 8 Sterling Lana
# 5 4 2015-04-19 22:11:33 15 Sterling Mallory
# 6 1 2015-01-01 11:12:14 1 Sterling Pam
# 7 21 2015-01-01 11:12:15 3 Sterling Pam
OK, so, the message you got was because the fuzzy match can't be calculated on a non-numeric column.
The thing to do is to convert it to numeric. Since your caliper is in seconds I converted it to seconds and then made them numeric:
library(dplyr)
library(fuzzyjoin)
library(lubridate)
df1 = data_frame(Item=1:5,
DateTime=c("2015-01-01 11:12:14", "2015-01-02 09:15:23",
"2015-01-02 15:46:11", "2015-04-19 22:11:33",
"2015-06-10 07:00:00"),
Count=c(1, 6, 11, 15, 9),
Name="Sterling",
Friend=c("Pam", "Cyril", "Cheryl", "Mallory", "Lana"))
df1$DateTime1 = as.numeric(seconds(ymd_hms(df1$DateTime)))
df2 = data_frame(Item=21:25,
DateTime=c("2015-01-01 11:12:15", "2015-01-02 19:25:56",
"2015-01-02 15:46:11", "2015-05-19 22:11:33",
"2015-06-10 07:00:02"),
Count=c(3, 6, 11, 15, 8),
Name="Sterling",
Friend=c("Pam", "Kreger", "Woodhouse", "Gillete", "Lana"))
df2$DateTime1 = as.numeric(seconds(ymd_hms(df2$DateTime)))
df1 %>%
difference_left_join(y=df2, by=c("DateTime1", "Count"), max_dist=2)
based on our discussion in the comments a simple tweak to subset this to the case of other character columns matching would be:
df1[df2$Friend == df1$Friend,] %>%
difference_left_join(y=df2[df2$Friend == df1$Friend,], by=c("DateTime1", "Count"), max_dist=2)
That example is just for Friend
but of course you could use &
to do it with multiple columns.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With