How to make a fuzzy join in R using more than one variable on each side

Q: How do you make a fuzzy join in R?

Often you may want to join together two datasets in R based on imperfectly matching strings. This is sometimes called fuzzy matching. The easiest way to perform fuzzy matching in R is to use the stringdist_join() function from the fuzzyjoin package.

Q: What are fuzzy joins?

The “fuzzy join” recipe is dedicated to joins between two datasets when join keys don't match exactly. It works by calculating a distance chosen by user and then comparing it to a threshold. DSS handles inner, left, right or outer joins.

Tags:

merge

r

fuzzy-search

fuzzyjoin

I would like to join the two data frames :

a <- data.frame(x=c(1,3,5))
b <- data.frame(start=c(0,4),end=c(2,6),y=c("a","b"))

with a condition like (x>start)&(x<end) in order to get such a result:

#  x    y
#1 1    a
#2 2 <NA>
#3 3    b

I don't want to make a potentially large cartesian product and then select only the few rows matching the condition and I'd like a solution using the tidyverse (I am not interested in a solution using SQL which would be a confession of failure). I thought of the 'fuzzyjoin' package but I cannot find examples fitting my need : the function to apply for the condition has only two arguments. I also tried to put 'start' and 'end' into a single argument with data.frame(z=I(purrr::map2(b$start,b$end,list)),y=b$y) # z y #1 0, 2 a #2 4, 6 b

but although the data looks fine fuzzy_left_join doesn't accept it.

I search for solutions working in more general cases (n variables on the LHS, m on the RHS, not necessarily numeric with arbitrary conditions).

UPDATE

I also want to be able to express conditions like (x=start+1)|(x=end+1) giving here:

#   x  y
#1  1  a
#2  3  a
#3  5  b

649

asked May 29 '18 11:05

Nicolas2

1 Answers

For this case you don't need multi_by or multy_match_fun, this works :

library(fuzzyjoin)
fuzzy_left_join(a, b, by = c(x = "start", x = "end"), match_fun = list(`>`, `<`))
#   x start end    y
# 1 1     0   2    a
# 2 3    NA  NA <NA>
# 3 5     4   6    b

112

answered Sep 28 '22 16:09

Moody_Mudskipper

Related questions
                            
                                Logarithmic scale plot in R
                            
                                Add visitor count and analytics to R blogdown > netlify housted website
                            
                                grepl across multiple, specified columns
                            
                                Fill in sequential values in a dataframe
                            
                                Condition in ifelse: Value in multiple columns/variables
                            
                                Change the color of a ggplot geom a posteriori (after having specified another color)
                            
                                Extracting Information from Multi-Level Nested Lists
                            
                                Create 'dummy variables' by spreading duplicate rows into columns in R
                            
                                Using Likert Package in R for analyzing real survey data
                            
                                Two conditions for split a column
                            
                                How can I put multiple plots side-by-side in a tab panel with other outputs present, shiny r?
                            
                                Replace multiple values in a list in R
                            
                                Inner-Joining two sf objects by non sf column
                            
                                unable to set xlim and ylim using min() and max() in ggplot
                            
                                Retain list names after applying map
                            
                                From tibble to txt or excel file in R
                            
                                dplyr mutate a variable by comparing a variable and vectors of different sizes
                            
                                tidyr::expand() for a single column across groups
                            
                                accessing colors from a ggtheme theme in ggplot
                            
                                rlang: Get names from ... with colon shortcut in NSE function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With