I am looking for efficient way to join 2 data.frames/data.tables on character column using grep/like/stri_detect condition.
I am able to use sqldf package with join on like, but is pretty slow. On my 2 data.tables (5k rows, 20k rows) it takes about 60 seconds.
My second approach was to use CJ from data.table and after that stri_detect_fixed on 2 columns. This approach is faster(16 seconds) but I am afraid that with growing data it will be impossible to use( it significantly increase ram usage).
I also tried to do it in for loop but it was the slowest one.
Is there any way to do it faster especially in data.table ?
Below I paste my example :
library(stringi)
library(data.table)
library(sqldf)
data1 <- data.table(col1 = paste0(c("asdasd asdasd 768jjhknmnmnj",
"78967ggh","kl00896754","kl008jku"),1:10000))
data2 <- data.table(col2 = paste0(c("mnj", "12345","kl008","lll1"), 1:10000))
system.time(join1 <- data.table(sqldf("select *
from data1 a inner join data2 b
on a.col1 like '%' || b.col2 || '%'", drv = "SQLite" )))
system.time(kartezjan <- CJ(col1 = data1[,c("col1"), with = F][[1]],
col2 = data2[,c("col2"), with = F][[1]],
unique = TRUE)[stri_detect_fixed(col1, col2, case_insensitive = FALSE)])
grep () vs. grepl () functions in R The grepl () is a built-in function that searches for matches of a string or string vector. The grepl () function returns TRUE if a string contains the pattern, otherwise FALSE. The grep () function searches for matches of a certain character pattern.
The data frames must have same column names on which the merging happens. Merge () Function in R is similar to database join operation in SQL. The different arguments to merge () allow you to perform natural joins i.e. inner join, left join, right join,cross join, semi join, anti join and full outer join.
Merge () Function in R is similar to database join operation in SQL. The different arguments to merge () allow you to perform natural joins i.e. inner join, left join, right join,cross join, semi join, anti join and full outer join. We can perform Join in R using merge () Function or by using family of join () functions in dplyr package.
The default value is all=FALSE (meaning that only the matching rows are returned). UNDERSTANDING THE DIFFERENT TYPES OF MERGE IN R: Natural join or Inner Join : To keep only rows that match from the data frames, specify the argument all=FALSE. Full outer join or Outer Join: To keep all rows from both data frames, specify all=TRUE.
The sqldf
approach is the fastest on my machine for your example data, but here is a faster data.table
version in case it helps.
library(data.table)
library(sqldf)
## Example data
v1 <- paste0(c("asdasd asdasd 768jjhknmnmnj", "78967ggh","kl00896754","kl008jku"),
1:10000)
v2 <- paste0(c("mnj", "12345","kl008","lll1"), 1:10000)
data1 <- data.table(col1=v1, key="col1")
data2 <- data.table(col2=v2, key="col2")
## sqldf version
system.time(
ans1 <- data.table(sqldf(
"select *
from data1 a inner join data2 b
on instr(a.col1, b.col2)", drv="SQLite"))
)
## user system elapsed
## 17.579 0.036 17.654
## parallelized data.table version
suppressMessages(library(foreach)); suppressMessages(library(doParallel))
cores <- detectCores() ## I've got 4...
clust <- makeForkCluster(cores)
registerDoParallel(clust)
system.time({
batches <- cores
data2[, group:=sort(rep_len(1:batches, nrow(data2)))]
ans2 <- foreach(
i=1:batches, .combine=function(...) rbindlist(list(...)),
.multicombine=TRUE, .inorder=FALSE) %dopar% {
CJ(col1=data1[, col1], col2=data2[group==i, col2])[,
alike:=col1 %like% col2, by=col2][
alike==TRUE][, alike:=NULL][]
}
})
## user system elapsed
## 0.185 0.229 30.295
stopCluster(clust)
stopImplicitCluster()
I'm running this on OSX--you may need to tweak the parallelization code for other operating systems. Also, if your actual data is bigger and you're running out of memory, you can try larger batches
values.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With