I have a data.frame
data
data = structure(list(mystring = c("AASDAASADDLKJLKADDLKKLLKJLJADDLJLKJLADLKLADD",
"ASDSDFJSKADDKJSJKDFKSADDLKJFLAK"), class = c("cat", "dog")), .Names = c("mystring",
"class"), row.names = c(NA, -2L), class = "data.frame")
which looks like
#> dtt1
# mystring class
#1 AASDAASADDLKJLKADDLKKLLKJLJADDLJLKJLADLKLADD cat
#2 ASDSDFJSKADDKJSJKDFKSADDLKJFLAK dog
I am searching the start and end positions of a pattern "ADD" with in the first 20 characters in the strings under mystring
considering class
as the group.
I am doing this using str_locate
of stringr
package. Here is my attempt
setDT(dtt1)[,
cbind(list(str_locate_all(substr(as.character(mystring), 1, 20),"ADD")[[1]][,1]),
list(str_locate_all(substr(as.character(mystring), 1, 20),"ADD")[[1]][,2])),
by = class]
This gives my desired output
# class V1 V2
#1: cat 8 10
#2: cat 16 18
#3: dog 10 12
Question:
I would like to know if this is a standard approach or this can be done in a more efficient manner. str_locate
gives the start
and end
positions of the matched pattern in separate columns, and I am putting them in separate list to cbind
them together with the data.table
? Also how can I specify the colnames
for the cbinded columns
here?
I think you first should reduce your operations per group, so I would first create a substring for all groups at once.
setDT(data)[, submystring := .Internal(substr(mystring, 1L, 20L))]
Then, using the stringi
package (I don't like wrappers), you could do (though can't currently vouch for efficiency)
library(stringi)
data[, data.table(matrix(unlist(stri_locate_all_fixed(submystring, "ADD")), ncol = 2)), by = class]
# class V1 V2
# 1: cat 8 10
# 2: cat 16 18
# 3: dog 10 12
Alternatively, you could avoid matrix
and data.table
calls per group but spread the data after all the location were detected
res <- data[, unlist(stri_locate_all_fixed(submystring, "ADD")), by = class]
res[, `:=`(varnames = rep(c("V1", "V2"), each = .N/2), MatchCount = rep(1:(.N/2), .N/2)), by = class]
dcast(res, class + MatchCount ~ varnames, value.var = "V1")
# class MatchCount V1 V2
# 1: cat 1 8 10
# 2: cat 2 16 18
# 3: dog 1 10 12
Third similar option could be to try first run stri_locate_all_fixed
over the whole data set and only then to unlist
per group (instead of running both and unlist
and stri_locate_all_fixed
per group)
res <- data[, .(stri_locate_all_fixed(submystring, "ADD"), class = class)]
res[, N := lengths(V1)/2L]
res2 <- res[, unlist(V1), by = "class,N"]
res2[, `:=`(varnames = rep(c("V1", "V2"), each = N[1L]), MatchCount = rep(1:(N[1L]), N[1L])), by = class]
dcast(res2, class + MatchCount ~ varnames, value.var = "V1")
# class MatchCount V1 V2
# 1: cat 1 8 10
# 2: cat 2 16 18
# 3: dog 1 10 12
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With