I have a data frame that looks like this (Dataframe X):
id number found
1 5225 NA
2 2222 NA
3 3121 NA
I have another data frame that looks like this (Dataframe Y):
id number1 number2
1 4000 6000
3 2500 3300
3 7000 8000
What I want to do is this: For each value in the Dataframe X "number" column, search if it is equal to or between ANY of the "number1" and "number2" pair values of Dataframe Y. Additionally, for this "number1" and "number2" pair values, its respective "id" must match the "id" in Dataframe X. If this is all true, then I want to insert a "YES in the "found" column of the respective row in Dataframe X:
id number found
1 5225 YES
2 2222 NA
3 3121 YES
How would I go about doing this? Thanks for the help.
Here is an option using fuzzy_join
library(fuzzy_join)
library(dplyr)
fuzzy_left_join(X, Y[-1], by = c("number" = "number1", "number" = "number2"),
match_fun =list(`>=`, `<=`)) %>%
mutate(found = c(NA, "YES")[(!is.na(number1)) + 1]) %>%
select(names(X))
# id number found
#1 1 5225 YES
#2 2 2222 <NA>
#3 3 3121 YES
Or another option is a non-equi join with data.table
library(data.table)
setDT(X)[, found := NULL]
X[Y, found := "YES", on = .(number >= number1, number <= number2)]
X
# id number found
#1: 1 5225 YES
#2: 2 2222 <NA>
#3: 3 3121 YES
X <- structure(list(id = 1:3, number = c(5225L, 2222L, 3121L), found = c(NA,
NA, NA)), class = "data.frame", row.names = c(NA, -3L))
Y <- structure(list(id = 1:3, number1 = c(4000L, 2500L, 7000L), number2 = c(6000L,
3300L, 8000L)), class = "data.frame", row.names = c(NA, -3L))
We can loop over each x$number
using sapply
and check if it lies in range of any
of y$number1
and y$number2
and give the value accordingly.
x$found <- ifelse(sapply(x$number, function(p)
any(y$number1 <= p & y$number2 >= p)),"YES", NA)
x
# id number found
#1 1 5225 YES
#2 2 2222 <NA>
#3 3 3121 YES
Using the same logic but with replace
x$found <- replace(x$found,
sapply(x$number, function(p) any(y$number1 <= p & y$number2 >= p)), "YES")
EDIT
If we want to also compare the id
value we could do
x$found <- ifelse(sapply(seq_along(x$number), function(i) {
inds <- y$number1 <= x$number[i] & y$number2 >= x$number[i]
any(inds) & (x$id[i] == y$id[which.max(inds)])
}), "YES", NA)
x$found
#[1] "YES" NA "YES"
Using tidyverse
functions, especially map_chr
to iterate over each number:
library(tidyverse)
tbl1 <- read_table2(
"id number found
1 5225 NA
2 2222 NA
3 3121 NA"
)
tbl2 <- read_table2(
"id number1 number2
1 4000 6000
2 2500 3300
3 7000 8000"
)
tbl1 %>%
mutate(found = map_chr(
.x = number,
.f = ~ if_else(
condition = any(.x > tbl2$number1 & .x < tbl2$number2),
true = "YES",
false = NA_character_
)
))
#> # A tibble: 3 x 3
#> id number found
#> <int> <int> <chr>
#> 1 1 5225 YES
#> 2 2 2222 <NA>
#> 3 3 3121 YES
Created on 2018-10-18 by the reprex package (v0.2.0).
Using sqldf
:
library(sqldf)
sql <- "SELECT DISTINCT x.id, x.number, "
sql <- paste0(sql, "CASE WHEN y.id IS NOT NULL THEN 'YES' END AS found ")
sql <- paste0(sql, "FROM X x LEFT JOIN Y y ON x.number BETWEEN y.number1 AND y.number2")
X <- sqldf(sql)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With