Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merging two Data frames with fuzzy merge/sqldf

Tags:

r

I have the following dataframes (df11 and df22) I'd like to do a merge/full join on with "UserID=UserID" and date difference <= 30 . So if the UserIDs match up AND the date's are less than or equal to 30, I'd like them merged into one singular row. I've looked at fuzzy join here and sqldf here but I can't figure out how to implement either of those for my data frames.

df1 <- structure(list(UserID = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L), 
                      Full.Name = c( "John Smith", "Jack Peters", "Bob Brown", "Jane Doe", "Jackie Jane", "Sarah Brown", "Chloe Brown", "John Smith" ), 
                      Info = c("yes", "no", "yes", "yes", "yes", "yes", "no", "yes"), 
                      EncounterID = c(13L, 14L, 15L, 16L, 17L, 18L, 19L, 13L), DateTime = c("1/2/21 00:00", "1/5/21 12:00", "1/1/21 1:31", "1/5/21 3:34", "5/9/21 5:33", "5/8/21 3:39", "12/12/21 2:30", "12/11/21 9:21"), 
                      Temp = c("100", "103", "104", "103", "101", "102", "103", "105"), 
 
                      misc = c("(null)", "no", "(null)", "(null)", "(null)","(null)", "(null)", "(null)" 
                                    )), 
                 class = "data.frame", row.names = c(NA, 
                                                     -8L))

df2 <- structure(list(UserID = c(1L, 2L, 3L, 4L, 5L, 6L), 
                      Full.Name = c("John Smith", "Jack Peters", "Bob Brown", "Jane Doe", "Jackie Jane", "Sarah Brown"), 
                      DOB = c("1/1/90", "1/10/90", "1/2/90", "2/20/80", "2/2/80", "12/2/80"), 
                      EncounterID = c(13L, 14L, 15L, 16L, 17L, 18L), EncounterDate = c("1/1/21", "1/2/21", "1/1/21", "1/6/21", "5/7/21", "5/8/21"), 
                      Type = c("Intro", "Intro", "Intro", "Intro", "Care", "Out"), 
                      responses = c("(null)", "no", 
                                    "yes", "no", "no", "unsat")), 
                      
                 class = "data.frame", row.names = c(NA, 
                                                     -6L))
loadedNamespaces()
install.packages("Rcpp")
library(dplyr)
library(tidyr)
install.packages("lubridate")
library(lubridate)

df11 <- 
df1 %>% 
  separate(DateTime, c("Date", "Time"), sep=" ") %>% 
  mutate(Date = as_datetime(mdy(Date))) %>% 
  select(-Time) %>% 
  as_tibble()

df22 <-
df2 %>% 
  mutate(across(c(EncounterDate), mdy)) %>% 
  mutate(across(c(EncounterDate), as_datetime)) %>% 
  as_tibble()

@r2evans After running the first set of code, I get the following output. Which is slightly different from yours.

df11 <- mutate(df11, Date_m30 = Date %m-% days(30), Date_p30 = Date %m+% days(30))
df11
# A tibble: 8 x 7
  UserID Full.Name   Info  EncounterID Date                Temp  misc  
   <int> <chr>       <chr>       <int> <dttm>              <chr> <chr> 
1      1 John Smith  yes            13 2021-01-02 00:00:00 100   (null)
2      2 Jack Peters no             14 2021-01-05 00:00:00 103   no    
3      3 Bob Brown   yes            15 2021-01-01 00:00:00 104   (null)
4      4 Jane Doe    yes            16 2021-01-05 00:00:00 103   (null)
5      5 Jackie Jane yes            17 2021-05-09 00:00:00 101   (null)
6      6 Sarah Brown yes            18 2021-05-08 00:00:00 102   (null)
7      7 Chloe Brown no             19 2021-12-12 00:00:00 103   (null)
8      1 John Smith  yes            13 2021-12-11 00:00:00 105   (null)
like image 201
Ash S Avatar asked Jan 24 '26 02:01

Ash S


1 Answers

One way is to first create "+/- 30 day" columns in one of them, then do a standard date-range join. Using sqldf:

Prep:

library(dplyr)
df11 <- mutate(df11, Date_m30 = Date %m-% days(30), Date_p30 = Date %m+% days(30))
df11
# # A tibble: 8 x 9
#   UserID Full.Name   Info  EncounterID Date                Temp  misc   Date_m30            Date_p30           
#    <int> <chr>       <chr>       <int> <dttm>              <chr> <chr>  <dttm>              <dttm>             
# 1      1 John Smith  yes            13 2021-01-02 00:00:00 100   (null) 2020-12-03 00:00:00 2021-02-01 00:00:00
# 2      2 Jack Peters no             14 2021-01-05 00:00:00 103   no     2020-12-06 00:00:00 2021-02-04 00:00:00
# 3      3 Bob Brown   yes            15 2021-01-01 00:00:00 104   (null) 2020-12-02 00:00:00 2021-01-31 00:00:00
# 4      4 Jane Doe    yes            16 2021-01-05 00:00:00 103   (null) 2020-12-06 00:00:00 2021-02-04 00:00:00
# 5      5 Jackie Jane yes            17 2021-05-09 00:00:00 101   (null) 2021-04-09 00:00:00 2021-06-08 00:00:00
# 6      6 Sarah Brown yes            18 2021-05-08 00:00:00 102   (null) 2021-04-08 00:00:00 2021-06-07 00:00:00
# 7      7 Chloe Brown no             19 2021-12-12 00:00:00 103   (null) 2021-11-12 00:00:00 2022-01-11 00:00:00
# 8      1 John Smith  yes            13 2021-12-11 00:00:00 105   (null) 2021-11-11 00:00:00 2022-01-10 00:00:00

The join:

sqldf::sqldf("
    select df11.*, df22.DOB, df22.EncounterDate, df22.Type, df22.responses
    from df11
      left join df22 on df11.UserID = df22.UserID
        and df22.EncounterDate between df11.Date_m30 and df11.Date_p30") %>%
  select(-Date_m30, -Date_p30)
#   UserID   Full.Name Info EncounterID                Date Temp   misc     DOB       EncounterDate  Type responses
# 1      1  John Smith  yes          13 2021-01-01 19:00:00  100 (null)  1/1/90 2020-12-31 19:00:00 Intro    (null)
# 2      2 Jack Peters   no          14 2021-01-04 19:00:00  103     no 1/10/90 2021-01-01 19:00:00 Intro        no
# 3      3   Bob Brown  yes          15 2020-12-31 19:00:00  104 (null)  1/2/90 2020-12-31 19:00:00 Intro       yes
# 4      4    Jane Doe  yes          16 2021-01-04 19:00:00  103 (null) 2/20/80 2021-01-05 19:00:00 Intro        no
# 5      5 Jackie Jane  yes          17 2021-05-08 20:00:00  101 (null)  2/2/80 2021-05-06 20:00:00  Care        no
# 6      6 Sarah Brown  yes          18 2021-05-07 20:00:00  102 (null) 12/2/80 2021-05-07 20:00:00   Out     unsat
# 7      7 Chloe Brown   no          19 2021-12-11 19:00:00  103 (null)    <NA>                <NA>  <NA>      <NA>
# 8      1  John Smith  yes          13 2021-12-10 19:00:00  105 (null)    <NA>                <NA>  <NA>      <NA>
like image 82
r2evans Avatar answered Jan 26 '26 18:01

r2evans