Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way to cbind list by groups in data.table

Tags:

r

data.table

I have a data.frame

data

data = structure(list(mystring = c("AASDAASADDLKJLKADDLKKLLKJLJADDLJLKJLADLKLADD", 
    "ASDSDFJSKADDKJSJKDFKSADDLKJFLAK"), class = c("cat", "dog")), .Names = c("mystring", 
    "class"), row.names = c(NA, -2L), class = "data.frame")

which looks like

#> dtt1
#                                      mystring class
#1 AASDAASADDLKJLKADDLKKLLKJLJADDLJLKJLADLKLADD   cat
#2              ASDSDFJSKADDKJSJKDFKSADDLKJFLAK   dog

I am searching the start and end positions of a pattern "ADD" with in the first 20 characters in the strings under mystring considering class as the group.

I am doing this using str_locate of stringr package. Here is my attempt

setDT(dtt1)[, 
cbind(list(str_locate_all(substr(as.character(mystring), 1, 20),"ADD")[[1]][,1]),
      list(str_locate_all(substr(as.character(mystring), 1, 20),"ADD")[[1]][,2])), 
      by = class]

This gives my desired output

#   class V1 V2
#1:   cat  8 10
#2:   cat 16 18
#3:   dog 10 12

Question: I would like to know if this is a standard approach or this can be done in a more efficient manner. str_locate gives the start and end positions of the matched pattern in separate columns, and I am putting them in separate list to cbind them together with the data.table? Also how can I specify the colnames for the cbinded columns here?

like image 509
Veerendra Gadekar Avatar asked Dec 24 '22 15:12

Veerendra Gadekar


1 Answers

I think you first should reduce your operations per group, so I would first create a substring for all groups at once.

setDT(data)[, submystring := .Internal(substr(mystring, 1L, 20L))]

Then, using the stringi package (I don't like wrappers), you could do (though can't currently vouch for efficiency)

library(stringi)
data[, data.table(matrix(unlist(stri_locate_all_fixed(submystring, "ADD")), ncol = 2)), by = class]
#    class V1 V2
# 1:   cat  8 10
# 2:   cat 16 18
# 3:   dog 10 12

Alternatively, you could avoid matrix and data.table calls per group but spread the data after all the location were detected

res <- data[, unlist(stri_locate_all_fixed(submystring, "ADD")), by = class]
res[, `:=`(varnames = rep(c("V1", "V2"), each = .N/2), MatchCount = rep(1:(.N/2), .N/2)), by = class]
dcast(res, class + MatchCount ~ varnames, value.var = "V1")
#    class MatchCount V1 V2
# 1:   cat          1  8 10
# 2:   cat          2 16 18
# 3:   dog          1 10 12

Third similar option could be to try first run stri_locate_all_fixed over the whole data set and only then to unlist per group (instead of running both and unlist and stri_locate_all_fixed per group)

res <- data[, .(stri_locate_all_fixed(submystring, "ADD"), class = class)]
res[, N := lengths(V1)/2L]
res2 <- res[, unlist(V1), by = "class,N"]
res2[, `:=`(varnames = rep(c("V1", "V2"), each = N[1L]), MatchCount = rep(1:(N[1L]), N[1L])), by = class]
dcast(res2, class + MatchCount ~ varnames, value.var = "V1")
#    class MatchCount V1 V2
# 1:   cat          1  8 10
# 2:   cat          2 16 18
# 3:   dog          1 10 12
like image 157
David Arenburg Avatar answered Jan 15 '23 17:01

David Arenburg