Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R Faster gregexpr for very large strings

I am trying to gregexpr search for the locations of "ABCD" in a large string and "ABBD, ACCD, AAAD" in the same string. I want to output the "ABCD" search results and the "ABBD, ACCD, AAAD" search results in two separate columns of a data table.

My current approach is to use gregexpr separately, export each as a 1 column txt file, import each as matrices, sort each 1 column matrix so that the numbers ascend by row, column bind the two matrices, and convert the resulting two column matrix into a data table.

This approach seems to be highly inefficient when dealing with a extremely large string, and takes quite a while to complete. Is there any way to optimize the procedure? Thanks for your help!

# dummy string that is relatively short for this demo
x <- "ABCDACCDABBDABCDAAADACCDABBDABCD"

# SEARCH for 'ABCD' location
out1 <- gregexpr(pattern = "ABCD", x)
cat(paste(c(out1[[1]]), sep = "\n", collapse = "\n"), file = "~/out_1.txt")    

# SEARCH for 'A??D' location
outB <- gregexpr(pattern = "ABBD", x)
outC <- gregexpr(pattern = "ACCD", x)
outA <- gregexpr(pattern = "AAAD", x)
cat(paste(c(outA[[1]], outB[[1]], outC[[1]]), collapse = "\n"), file = "~/out_2.txt")

# Function that BINDS Matrices by column
cbind.fill <- function(...){
  nm <- list(...)
  nm <- lapply(nm, as.matrix)
  n <- max(sapply(nm, nrow))
  do.call(cbind, lapply(nm, function (x) rbind(x, matrix(, n-nrow(x), ncol(x)))))
}

# Load as Tables --> Sort by numbers increasing --> Matrices
mat1 <- as.matrix(read.table("~/out_1.txt"))
mat2.t <- (read.table("~/out_2.txt"))
mat2 <- as.matrix(mat2.t[order(mat2.t$V1),])

# Combine two matrices to create 2 column matrix 
comb_mat <- cbind.fill(mat1, mat2)
write.table(comb_mat, file = "~/comb_mat.txt", row.names = FALSE, col.names = FALSE)
like image 976
alki Avatar asked Jul 04 '15 02:07

alki


2 Answers

  • There's no need for the intermediate files.
  • I would use the fixed=T argument of gregexpr(), which will likely yield a performance benefit. From https://stat.ethz.ch/R-manual/R-devel/library/base/html/grep.html:

If you are doing a lot of regular expression matching, including on very long strings, you will want to consider the options used. Generally PCRE will be faster than the default regular expression engine, and fixed = TRUE faster still (especially when each pattern is matched only a few times).

  • You can use sort() to sort the second column immediately, rather than storing an intermediate variable and then indexing it with order().
  • Your cbind.fill() function works, but the task of NA-padding can be accomplished easily with out-of-bounds indexing, for which R naturally returns NAs for the out-of-bounds indexes.

Hence:

x <- 'ABCDACCDABBDABCDAAADACCDABBDABCD';
out1 <- c(gregexpr('ABCD',x,fixed=T)[[1]]);
out2 <- sort(c(gregexpr('AAAD',x,fixed=T)[[1]],gregexpr('ABBD',x,fixed=T)[[1]],gregexpr('ACCD',x,fixed=T)[[1]]));
outmax <- max(length(out1),length(out2));
comb_mat <- cbind(out1[1:outmax],out2[1:outmax]);
comb_mat;
##      [,1] [,2]
## [1,]    1    5
## [2,]   13    9
## [3,]   29   17
## [4,]   NA   21
## [5,]   NA   25

You can then write comb_mat to a file as per your write.table() call.


Edit: As you (and now I) have discovered, gregexpr() performs surprisingly poorly on large strings, and your 237MB string is definitely a large string. From Fast partial string matching in R, we can use the stringi package to speed up performance. What follows is a demo of how to use stringi::stri_locate_all() to accomplish your requirement. Some notes:

  • For my own testing I created my own 237MB file, which is actually exactly 237,000,001 bytes in size. I basically used vim to repeat your 32 byte example string 7,406,250 times, totalling 237,000,000 bytes, and the extra byte comes from the LF appended by vim. I named my test file x, and you can see I load it with data.table::fread(), as read.table() was taking way too long.
  • I made a minor change to my NA-padding algorithm. Instead of using out-of-bounds indexing, I realized we can assign the lengths of the vectors to the max length, leveraging right-to-left associativity for the assignment operator. The advantage here is that we don't have to construct the index vector 1:outmax anymore.

Hence:

library('data.table');
library('stringi');
x <- fread('x',header=F)$V1;
## Read 1 rows and 1 (of 1) columns from 0.221 GB file in 00:00:03
system.time({ out1 <- stri_locate_all(x,regex='ABCD')[[1]][,'start']; });
##    user  system elapsed
##   3.687   0.359   4.044
system.time({ out2 <- stri_locate_all(x,regex='AAAD|ABBD|ACCD')[[1]][,'start']; });
##    user  system elapsed
##   4.938   0.454   5.404
length(out1);
## [1] 22218750
length(out2);
## [1] 37031250
length(out1) <- length(out2) <- max(length(out1),length(out2));
comb_mat <- cbind(out1,out2);
head(comb_mat);
##      out1 out2
## [1,]    1    5
## [2,]   13    9
## [3,]   29   17
## [4,]   33   21
## [5,]   45   25
## [6,]   61   37
tail(comb_mat);
##             out1      out2
## [37031245,]   NA 236999961
## [37031246,]   NA 236999973
## [37031247,]   NA 236999977
## [37031248,]   NA 236999985
## [37031249,]   NA 236999989
## [37031250,]   NA 236999993
nrow(comb_mat);
## [1] 37031250
like image 57
bgoldst Avatar answered Sep 18 '22 12:09

bgoldst


You can simplify it with lookaheads, so you only have one regex with two captured components.

ms <- gregexpr("A(?=(BCD)|(BBD|CCD|AAD))", x, perl=T)
res <- attr(ms[[1]], "capture.start")
res[res>0] <- res[res>0]-1

In this matrix, res, the first column is starting positions of ABCD, the second of the other three. If you want you can replace the zeros with NA.

# [1,]  1  0
# [2,]  0  5
# [3,]  0  9
# [4,] 13  0
# [5,]  0 17
# [6,]  0 21
# [7,]  0 25
# [8,] 29  0
like image 22
Rorschach Avatar answered Sep 16 '22 12:09

Rorschach