Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R join on like/grep condition

I am looking for efficient way to join 2 data.frames/data.tables on character column using grep/like/stri_detect condition.

I am able to use sqldf package with join on like, but is pretty slow. On my 2 data.tables (5k rows, 20k rows) it takes about 60 seconds.

My second approach was to use CJ from data.table and after that stri_detect_fixed on 2 columns. This approach is faster(16 seconds) but I am afraid that with growing data it will be impossible to use( it significantly increase ram usage).

I also tried to do it in for loop but it was the slowest one.

Is there any way to do it faster especially in data.table ?

Below I paste my example :

library(stringi)
library(data.table)
library(sqldf)
data1 <- data.table(col1 = paste0(c("asdasd asdasd 768jjhknmnmnj",
"78967ggh","kl00896754","kl008jku"),1:10000))

data2 <- data.table(col2 = paste0(c("mnj", "12345","kl008","lll1"), 1:10000))

system.time(join1 <- data.table(sqldf("select * 
           from data1 a inner join data2 b
                      on a.col1 like '%' || b.col2 || '%'", drv = "SQLite" )))



system.time(kartezjan <- CJ(col1 = data1[,c("col1"), with = F][[1]],
                            col2 = data2[,c("col2"), with = F][[1]],
 unique  = TRUE)[stri_detect_fixed(col1, col2, case_insensitive = FALSE)])
like image 646
Kacper Avatar asked Mar 15 '16 12:03

Kacper


People also ask

What is the difference between Grep() and grepl() in R?

grep () vs. grepl () functions in R The grepl () is a built-in function that searches for matches of a string or string vector. The grepl () function returns TRUE if a string contains the pattern, otherwise FALSE. The grep () function searches for matches of a certain character pattern.

What are the different types of joins in R?

The data frames must have same column names on which the merging happens. Merge () Function in R is similar to database join operation in SQL. The different arguments to merge () allow you to perform natural joins i.e. inner join, left join, right join,cross join, semi join, anti join and full outer join.

How to join in R using merge () function?

Merge () Function in R is similar to database join operation in SQL. The different arguments to merge () allow you to perform natural joins i.e. inner join, left join, right join,cross join, semi join, anti join and full outer join. We can perform Join in R using merge () Function or by using family of join () functions in dplyr package.

What is the default value of join in R?

The default value is all=FALSE (meaning that only the matching rows are returned). UNDERSTANDING THE DIFFERENT TYPES OF MERGE IN R: Natural join or Inner Join : To keep only rows that match from the data frames, specify the argument all=FALSE. Full outer join or Outer Join: To keep all rows from both data frames, specify all=TRUE.


1 Answers

The sqldf approach is the fastest on my machine for your example data, but here is a faster data.table version in case it helps.

library(data.table)
library(sqldf)

## Example data
v1 <- paste0(c("asdasd asdasd 768jjhknmnmnj", "78967ggh","kl00896754","kl008jku"),
    1:10000)
v2 <- paste0(c("mnj", "12345","kl008","lll1"), 1:10000)

data1 <- data.table(col1=v1, key="col1")
data2 <- data.table(col2=v2, key="col2")


## sqldf version
system.time(
  ans1 <- data.table(sqldf(
    "select * 
    from data1 a inner join data2 b
    on instr(a.col1, b.col2)", drv="SQLite"))
  )

##    user  system elapsed 
##  17.579   0.036  17.654 


## parallelized data.table version
suppressMessages(library(foreach)); suppressMessages(library(doParallel))
cores <- detectCores() ## I've got 4...
clust <- makeForkCluster(cores)
registerDoParallel(clust)

system.time({
  batches <- cores
  data2[, group:=sort(rep_len(1:batches, nrow(data2)))]
  ans2 <- foreach(
    i=1:batches, .combine=function(...) rbindlist(list(...)),
    .multicombine=TRUE, .inorder=FALSE) %dopar% {
      CJ(col1=data1[, col1], col2=data2[group==i, col2])[,
        alike:=col1 %like% col2, by=col2][
          alike==TRUE][, alike:=NULL][]          
    }
})

##    user  system elapsed 
##   0.185   0.229  30.295 

stopCluster(clust)
stopImplicitCluster()

I'm running this on OSX--you may need to tweak the parallelization code for other operating systems. Also, if your actual data is bigger and you're running out of memory, you can try larger batches values.

like image 135
dnlbrky Avatar answered Oct 19 '22 22:10

dnlbrky