Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to make a fuzzy join in R using more than one variable on each side

I would like to join the two data frames :

a <- data.frame(x=c(1,3,5))
b <- data.frame(start=c(0,4),end=c(2,6),y=c("a","b"))

with a condition like (x>start)&(x<end) in order to get such a result:

#  x    y
#1 1    a
#2 2 <NA>
#3 3    b

I don't want to make a potentially large cartesian product and then select only the few rows matching the condition and I'd like a solution using the tidyverse (I am not interested in a solution using SQL which would be a confession of failure). I thought of the 'fuzzyjoin' package but I cannot find examples fitting my need : the function to apply for the condition has only two arguments. I also tried to put 'start' and 'end' into a single argument with data.frame(z=I(purrr::map2(b$start,b$end,list)),y=b$y) # z y #1 0, 2 a #2 4, 6 b

but although the data looks fine fuzzy_left_join doesn't accept it.

I search for solutions working in more general cases (n variables on the LHS, m on the RHS, not necessarily numeric with arbitrary conditions).

UPDATE

I also want to be able to express conditions like (x=start+1)|(x=end+1) giving here:

#   x  y
#1  1  a
#2  3  a
#3  5  b
like image 649
Nicolas2 Avatar asked May 29 '18 11:05

Nicolas2


People also ask

How do you make a fuzzy join in R?

Often you may want to join together two datasets in R based on imperfectly matching strings. This is sometimes called fuzzy matching. The easiest way to perform fuzzy matching in R is to use the stringdist_join() function from the fuzzyjoin package.

What are fuzzy joins?

The “fuzzy join” recipe is dedicated to joins between two datasets when join keys don't match exactly. It works by calculating a distance chosen by user and then comparing it to a threshold. DSS handles inner, left, right or outer joins.


1 Answers

For this case you don't need multi_by or multy_match_fun, this works :

library(fuzzyjoin)
fuzzy_left_join(a, b, by = c(x = "start", x = "end"), match_fun = list(`>`, `<`))
#   x start end    y
# 1 1     0   2    a
# 2 3    NA  NA <NA>
# 3 5     4   6    b
like image 112
Moody_Mudskipper Avatar answered Sep 28 '22 16:09

Moody_Mudskipper