I am looking for efficient way to join 2 data.frames/data.tables on character column using grep/like/stri_detect condition. I am able to use sqldf package with join on like, but is pretty slow. On my 2 data.tables (5k rows, 20k rows) it takes about 60 seconds. My second approach was to use CJ from data.table and after that stri_detect_fixed on 2 columns. This approach is faster(16 seconds) but I am afraid that with growing data it will be impossible to use( it significantly increase ram usage). I also tried to do it in for loop but it was the slowest one. Is there any way to do it faster especially in data.table ? Below I paste my example : <pre class="prettyprint"><code>library(stringi) library(data.table) library(sqldf) data1 <- data.table(col1 = paste0(c("asdasd asdasd 768jjhknmnmnj", "78967ggh","kl00896754","kl008jku"),1:10000)) data2 <- data.table(col2 = paste0(c("mnj", "12345","kl008","lll1"), 1:10000)) system.time(join1 <- data.table(sqldf("select * from data1 a inner join data2 b on a.col1 like '%' || b.col2 || '%'", drv = "SQLite" ))) system.time(kartezjan <- CJ(col1 = data1[,c("col1"), with = F][[1]], col2 = data2[,c("col2"), with = F][[1]], unique = TRUE)[stri_detect_fixed(col1, col2, case_insensitive = FALSE)]) </code></pre>

The <code>sqldf</code> approach is the fastest on my machine for your example data, but here is a faster <code>data.table</code> version in case it helps. <pre class="prettyprint"><code>library(data.table) library(sqldf) ## Example data v1 <- paste0(c("asdasd asdasd 768jjhknmnmnj", "78967ggh","kl00896754","kl008jku"), 1:10000) v2 <- paste0(c("mnj", "12345","kl008","lll1"), 1:10000) data1 <- data.table(col1=v1, key="col1") data2 <- data.table(col2=v2, key="col2") ## sqldf version system.time( ans1 <- data.table(sqldf( "select * from data1 a inner join data2 b on instr(a.col1, b.col2)", drv="SQLite")) ) ## user system elapsed ## 17.579 0.036 17.654 ## parallelized data.table version suppressMessages(library(foreach)); suppressMessages(library(doParallel)) cores <- detectCores() ## I've got 4... clust <- makeForkCluster(cores) registerDoParallel(clust) system.time({ batches <- cores data2[, group:=sort(rep_len(1:batches, nrow(data2)))] ans2 <- foreach( i=1:batches, .combine=function(...) rbindlist(list(...)), .multicombine=TRUE, .inorder=FALSE) %dopar% { CJ(col1=data1[, col1], col2=data2[group==i, col2])[, alike:=col1 %like% col2, by=col2][ alike==TRUE][, alike:=NULL][] } }) ## user system elapsed ## 0.185 0.229 30.295 stopCluster(clust) stopImplicitCluster() </code></pre> I'm running this on OSX--you may need to tweak the parallelization code for other operating systems. Also, if your actual data is bigger and you're running out of memory, you can try larger <code>batches</code> values.

R join on like/grep condition

Tags:

join

r

sql-like

data.table

I am looking for efficient way to join 2 data.frames/data.tables on character column using grep/like/stri_detect condition.

I am able to use sqldf package with join on like, but is pretty slow. On my 2 data.tables (5k rows, 20k rows) it takes about 60 seconds.

My second approach was to use CJ from data.table and after that stri_detect_fixed on 2 columns. This approach is faster(16 seconds) but I am afraid that with growing data it will be impossible to use( it significantly increase ram usage).

I also tried to do it in for loop but it was the slowest one.

Is there any way to do it faster especially in data.table ?

Below I paste my example :

library(stringi)
library(data.table)
library(sqldf)
data1 <- data.table(col1 = paste0(c("asdasd asdasd 768jjhknmnmnj",
"78967ggh","kl00896754","kl008jku"),1:10000))

data2 <- data.table(col2 = paste0(c("mnj", "12345","kl008","lll1"), 1:10000))

system.time(join1 <- data.table(sqldf("select * 
           from data1 a inner join data2 b
                      on a.col1 like '%' || b.col2 || '%'", drv = "SQLite" )))



system.time(kartezjan <- CJ(col1 = data1[,c("col1"), with = F][[1]],
                            col2 = data2[,c("col2"), with = F][[1]],
 unique  = TRUE)[stri_detect_fixed(col1, col2, case_insensitive = FALSE)])

646

asked Mar 15 '16 12:03

Kacper

1 Answers

The sqldf approach is the fastest on my machine for your example data, but here is a faster data.table version in case it helps.

library(data.table)
library(sqldf)

## Example data
v1 <- paste0(c("asdasd asdasd 768jjhknmnmnj", "78967ggh","kl00896754","kl008jku"),
    1:10000)
v2 <- paste0(c("mnj", "12345","kl008","lll1"), 1:10000)

data1 <- data.table(col1=v1, key="col1")
data2 <- data.table(col2=v2, key="col2")


## sqldf version
system.time(
  ans1 <- data.table(sqldf(
    "select * 
    from data1 a inner join data2 b
    on instr(a.col1, b.col2)", drv="SQLite"))
  )

##    user  system elapsed 
##  17.579   0.036  17.654 


## parallelized data.table version
suppressMessages(library(foreach)); suppressMessages(library(doParallel))
cores <- detectCores() ## I've got 4...
clust <- makeForkCluster(cores)
registerDoParallel(clust)

system.time({
  batches <- cores
  data2[, group:=sort(rep_len(1:batches, nrow(data2)))]
  ans2 <- foreach(
    i=1:batches, .combine=function(...) rbindlist(list(...)),
    .multicombine=TRUE, .inorder=FALSE) %dopar% {
      CJ(col1=data1[, col1], col2=data2[group==i, col2])[,
        alike:=col1 %like% col2, by=col2][
          alike==TRUE][, alike:=NULL][]          
    }
})

##    user  system elapsed 
##   0.185   0.229  30.295 

stopCluster(clust)
stopImplicitCluster()

I'm running this on OSX--you may need to tweak the parallelization code for other operating systems. Also, if your actual data is bigger and you're running out of memory, you can try larger batches values.

135

answered Oct 19 '22 22:10

dnlbrky

Related questions
                            
                                R produces "unsupported URL scheme" error when getting data from https sites
                            
                                Is there an efficient alternative to table()?
                            
                                How can I group days into weeks?
                            
                                RMSE (root mean square deviation) calculation in R
                            
                                set size of a plot area in ggplot2 [duplicate]
                            
                                When should we use curly brackets { } when piping with dplyr [duplicate]
                            
                                getGraphicsEvent for reading the key board for a noninteractive session
                            
                                OLEDB connection in R
                            
                                GBM Rule Generation - Coding Advice
                            
                                ggplot2 - set lower bound greater than lowest point
                            
                                annotation_logticks() and coord_flip() seem incompatible
                            
                                fread unable to read .csv files with first column empty
                            
                                Change default CSS classes
                            
                                Unloading rJava and/or restarting JVM
                            
                                ggplot2: how to differentiate click from brush?
                            
                                Extending dplyr and use of internal functions

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With