Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dplyr: inner_join with a partial string match

I'd like to join two data frames if the seed column in data frame y is a partial match on the string column in x. This example should illustrate:

# What I have x <- data.frame(idX=1:3, string=c("Motorcycle", "TractorTrailer", "Sailboat")) y <- data_frame(idY=letters[1:3], seed=c("ractor", "otorcy", "irplan"))   x    idX         string 1   1     Motorcycle 2   2 TractorTrailer 3   3       Sailboat  y  Source: local data frame [3 x 2]      idY   seed   (chr)  (chr) 1     a ractor 2     b otorcy 3     c irplan   # What I want want <- data.frame(idX=c(1,2), idY=c("b", "a"), string=c("Motorcycle", "TractorTrailer"), seed=c("otorcy", "ractor"))  want    idX idY         string   seed 1   1   b     Motorcycle otorcy 2   2   a TractorTrailer ractor 

That is, something like

inner_join(x, y, by=stringr::str_detect(x$string, y$seed)) 
like image 284
Stephen Turner Avatar asked Oct 02 '15 19:10

Stephen Turner


People also ask

What is the difference between inner_join and left_join in dplyr?

Figure 3: dplyr left_join Function. The difference to the inner_join function is that left_join retains all rows of the data table, which is inserted first into the function (i.e. the X-data). Have a look at the R documentation for a precise definition: Example 3: right_join dplyr R Function

How does anti join work in dplyr?

Figure 7: dplyr anti_join Function. As you can see, the anti_join functions keeps only rows that are non-existent in the right-hand data AND keeps only columns of the left-hand data. The R help documentation of anti join is shown below: At this point you have learned the basic principles of the six dplyr join functions.

How do I extract rows with a partial match using Stringr?

This Example explains how to extract rows with a partial match using the stringr package. We first need to install and load the stringr package: Now we can subset our data with the str_detect function as shown below: As you can see, we have extracted only rows where the Species column partially matches the character string “virg”.

What is the difference between inner_join and right join in R?

The difference to the inner_join function is that left_join retains all rows of the data table, which is inserted first into the function (i.e. the X-data). Have a look at the R documentation for a precise definition: Right join is the reversed brother of left join:


1 Answers

The fuzzyjoin library has two functions regex_inner_join and fuzzy_inner_join that allow you to match partial strings:

x <- data.frame(idX=1:3, string=c("Motorcycle", "TractorTrailer", "Sailboat")) y <- data.frame(idY=letters[1:3], seed=c("ractor", "otorcy", "irplan")) x$string = as.character(x$string) y$seed = as.character(y$seed)   library(fuzzyjoin) x %>% regex_inner_join(y, by = c(string = "seed"))    idX         string idY   seed 1   1     Motorcycle   b otorcy 2   2 TractorTrailer   a ractor   library(stringr) x %>% fuzzy_inner_join(y, by = c("string" = "seed"), match_fun = str_detect)     idX         string idY   seed 1   1     Motorcycle   b otorcy 2   2 TractorTrailer   a ractor 
like image 185
Feng Mai Avatar answered Sep 19 '22 19:09

Feng Mai