I am trying to gregexpr search for the locations of "ABCD" in a large string and "ABBD, ACCD, AAAD" in the same string. I want to output the "ABCD" search results and the "ABBD, ACCD, AAAD" search results in two separate columns of a data table.
My current approach is to use gregexpr separately, export each as a 1 column txt file, import each as matrices, sort each 1 column matrix so that the numbers ascend by row, column bind the two matrices, and convert the resulting two column matrix into a data table.
This approach seems to be highly inefficient when dealing with a extremely large string, and takes quite a while to complete. Is there any way to optimize the procedure? Thanks for your help!
# dummy string that is relatively short for this demo
x <- "ABCDACCDABBDABCDAAADACCDABBDABCD"
# SEARCH for 'ABCD' location
out1 <- gregexpr(pattern = "ABCD", x)
cat(paste(c(out1[[1]]), sep = "\n", collapse = "\n"), file = "~/out_1.txt")
# SEARCH for 'A??D' location
outB <- gregexpr(pattern = "ABBD", x)
outC <- gregexpr(pattern = "ACCD", x)
outA <- gregexpr(pattern = "AAAD", x)
cat(paste(c(outA[[1]], outB[[1]], outC[[1]]), collapse = "\n"), file = "~/out_2.txt")
# Function that BINDS Matrices by column
cbind.fill <- function(...){
nm <- list(...)
nm <- lapply(nm, as.matrix)
n <- max(sapply(nm, nrow))
do.call(cbind, lapply(nm, function (x) rbind(x, matrix(, n-nrow(x), ncol(x)))))
}
# Load as Tables --> Sort by numbers increasing --> Matrices
mat1 <- as.matrix(read.table("~/out_1.txt"))
mat2.t <- (read.table("~/out_2.txt"))
mat2 <- as.matrix(mat2.t[order(mat2.t$V1),])
# Combine two matrices to create 2 column matrix
comb_mat <- cbind.fill(mat1, mat2)
write.table(comb_mat, file = "~/comb_mat.txt", row.names = FALSE, col.names = FALSE)
fixed=T
argument of gregexpr()
, which will likely yield a performance benefit. From https://stat.ethz.ch/R-manual/R-devel/library/base/html/grep.html:If you are doing a lot of regular expression matching, including on very long strings, you will want to consider the options used. Generally PCRE will be faster than the default regular expression engine, and fixed = TRUE faster still (especially when each pattern is matched only a few times).
sort()
to sort the second column immediately, rather than storing an intermediate variable and then indexing it with order()
.cbind.fill()
function works, but the task of NA-padding can be accomplished easily with out-of-bounds indexing, for which R naturally returns NAs for the out-of-bounds indexes.Hence:
x <- 'ABCDACCDABBDABCDAAADACCDABBDABCD';
out1 <- c(gregexpr('ABCD',x,fixed=T)[[1]]);
out2 <- sort(c(gregexpr('AAAD',x,fixed=T)[[1]],gregexpr('ABBD',x,fixed=T)[[1]],gregexpr('ACCD',x,fixed=T)[[1]]));
outmax <- max(length(out1),length(out2));
comb_mat <- cbind(out1[1:outmax],out2[1:outmax]);
comb_mat;
## [,1] [,2]
## [1,] 1 5
## [2,] 13 9
## [3,] 29 17
## [4,] NA 21
## [5,] NA 25
You can then write comb_mat
to a file as per your write.table()
call.
Edit: As you (and now I) have discovered, gregexpr()
performs surprisingly poorly on large strings, and your 237MB string is definitely a large string. From Fast partial string matching in R, we can use the stringi
package to speed up performance. What follows is a demo of how to use stringi::stri_locate_all()
to accomplish your requirement. Some notes:
x
, and you can see I load it with data.table::fread()
, as read.table()
was taking way too long.1:outmax
anymore.Hence:
library('data.table');
library('stringi');
x <- fread('x',header=F)$V1;
## Read 1 rows and 1 (of 1) columns from 0.221 GB file in 00:00:03
system.time({ out1 <- stri_locate_all(x,regex='ABCD')[[1]][,'start']; });
## user system elapsed
## 3.687 0.359 4.044
system.time({ out2 <- stri_locate_all(x,regex='AAAD|ABBD|ACCD')[[1]][,'start']; });
## user system elapsed
## 4.938 0.454 5.404
length(out1);
## [1] 22218750
length(out2);
## [1] 37031250
length(out1) <- length(out2) <- max(length(out1),length(out2));
comb_mat <- cbind(out1,out2);
head(comb_mat);
## out1 out2
## [1,] 1 5
## [2,] 13 9
## [3,] 29 17
## [4,] 33 21
## [5,] 45 25
## [6,] 61 37
tail(comb_mat);
## out1 out2
## [37031245,] NA 236999961
## [37031246,] NA 236999973
## [37031247,] NA 236999977
## [37031248,] NA 236999985
## [37031249,] NA 236999989
## [37031250,] NA 236999993
nrow(comb_mat);
## [1] 37031250
You can simplify it with lookaheads, so you only have one regex with two captured components.
ms <- gregexpr("A(?=(BCD)|(BBD|CCD|AAD))", x, perl=T)
res <- attr(ms[[1]], "capture.start")
res[res>0] <- res[res>0]-1
In this matrix, res
, the first column is starting positions of ABCD, the second of the other three. If you want you can replace the zeros with NA.
# [1,] 1 0
# [2,] 0 5
# [3,] 0 9
# [4,] 13 0
# [5,] 0 17
# [6,] 0 21
# [7,] 0 25
# [8,] 29 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With