Imagine I have a set of strings, say:
#1: "A-B-B-C-C"
#2: "A-A-A-A-A-A-A"
#3: "B-B-B-C-A-A"
Now I want to check whether certain patterns occur in the first, middle, or last third in the sequence. Hence, I want to be able to formulate a rule of the kind:
Match the string if, and only if,
marker X occurs in the first/middle/last third of the string
For example, I may want to match strings which have an A
in the first third. The considering the sequences above I would match #1
and #2
. I could also want to match strings which have an A
in the last third. This would match #2
and #3
.
How can I write a generic code/regex pattern that can take various rules of this kind as input and then match the appropriate subsequences?
To check if a string contains a substring in Python using the in operator, we simply invoke it on the superstring: fullstring = "StackAbuse" substring = "tack" if substring in fullstring: print("Found!") else: print("Not found!")
Using any() to check if string contains element from list. Using any function is the most classical way in which you can perform this task and also efficiently. This function checks for match in string with match of each element of list.
Here's a fully vectorized attempt (you can play around with the settings and tell me if you want to add/change something)
StriDetect <- function(x, seg = 1L, pat = "A", frac = 3L, fixed = TRUE, values = FALSE){
xsub <- gsub("-", "", x, fixed = TRUE)
sizes <- nchar(xsub) / frac
subs <- substr(xsub, sizes * (seg - 1L) + 1L, sizes * seg)
if(isTRUE(values)) x[grep(pat, subs, fixed = fixed)] else grep(pat, subs, fixed = fixed)
}
Testing on your vector
x <- c("A-B-B-C-C", "A-A-A-A-A-A-A", "B-B-B-C-A-A")
StriDetect(x, 1L, "A")
## [1] 1 2
StriDetect(x, 3L, "A")
## [1] 2 3
Or if you want the actual matched strings
StriDetect(x, 1L, "A", values = TRUE)
## [1] "A-B-B-C-C" "A-A-A-A-A-A-A"
StriDetect(x, 3L, "A", values = TRUE)
## [1] "A-A-A-A-A-A-A" "B-B-B-C-A-A"
Please note that when the string size doesn't divides exactly by 3 (for example, nchar(x) == 10
), the last third is the largest group by default (e.g. size 4 if nchar(x) == 10
)
Here's a solution which generates regexes to meet the desired requirements. Note regexes can count, but they can't count relative to the total string. So this generates a custom regex for each input string based on its length. I've used the stringi::stri_detect_regex
rather than grep
since the latter isn't vectorised on the pattern term. I've also assumed that the pattern
argument is itself a valid regular expression and that any characters that need escaping (e.g. [
, .
) are escaped.
library("stringi")
strings <- c("A-B-B-C-C", "A-A-A-A-A-A-A", "B-B-B-C-A-A")
get_regex_fn_fractions <- function(strings, pattern, which_fraction, n_groups = 3) {
before <- round(nchar(strings) / n_groups * (which_fraction - 1))
after <- round(nchar(strings) / n_groups * (n_groups - which_fraction))
sprintf("^.{%d}.*%s.*.{%d}$", before, pattern, after)
}
(patterns <- get_regex_thirds(strs, "A", 1))
#[1] "^.{0}.*A.*.{6}$" "^.{0}.*A.*.{9}$" "^.{0}.*A.*.{7}$"
#Test regexs:
stri_detect_regex(strings, patterns)
#[1] TRUE TRUE FALSE
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With