Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Identifying strings based on where substrings appear in the string

Tags:

string

regex

r

Imagine I have a set of strings, say:

#1: "A-B-B-C-C"
#2: "A-A-A-A-A-A-A"
#3: "B-B-B-C-A-A"

Now I want to check whether certain patterns occur in the first, middle, or last third in the sequence. Hence, I want to be able to formulate a rule of the kind:

Match the string if, and only if, 
marker X occurs in the first/middle/last third of the string

For example, I may want to match strings which have an A in the first third. The considering the sequences above I would match #1 and #2. I could also want to match strings which have an A in the last third. This would match #2 and #3.

How can I write a generic code/regex pattern that can take various rules of this kind as input and then match the appropriate subsequences?

like image 948
histelheim Avatar asked Jul 16 '15 09:07

histelheim


People also ask

How do you check if a substring appears in a string?

To check if a string contains a substring in Python using the in operator, we simply invoke it on the superstring: fullstring = "StackAbuse" substring = "tack" if substring in fullstring: print("Found!") else: print("Not found!")

How do you check if a list of substrings is in a string Python?

Using any() to check if string contains element from list. Using any function is the most classical way in which you can perform this task and also efficiently. This function checks for match in string with match of each element of list.


2 Answers

Here's a fully vectorized attempt (you can play around with the settings and tell me if you want to add/change something)

StriDetect <- function(x, seg = 1L, pat = "A", frac = 3L, fixed = TRUE, values = FALSE){
  xsub <- gsub("-", "", x, fixed = TRUE)
  sizes <- nchar(xsub) / frac
  subs <- substr(xsub, sizes * (seg - 1L) + 1L, sizes * seg)
  if(isTRUE(values)) x[grep(pat, subs, fixed = fixed)] else grep(pat, subs, fixed = fixed)
}

Testing on your vector

x <- c("A-B-B-C-C", "A-A-A-A-A-A-A", "B-B-B-C-A-A")
StriDetect(x, 1L, "A")
## [1] 1 2
StriDetect(x, 3L, "A")
## [1] 2 3

Or if you want the actual matched strings

StriDetect(x, 1L, "A", values = TRUE)
## [1] "A-B-B-C-C"     "A-A-A-A-A-A-A"
StriDetect(x, 3L, "A", values = TRUE)
## [1] "A-A-A-A-A-A-A" "B-B-B-C-A-A"  

Please note that when the string size doesn't divides exactly by 3 (for example, nchar(x) == 10), the last third is the largest group by default (e.g. size 4 if nchar(x) == 10)

like image 187
David Arenburg Avatar answered Sep 24 '22 15:09

David Arenburg


Here's a solution which generates regexes to meet the desired requirements. Note regexes can count, but they can't count relative to the total string. So this generates a custom regex for each input string based on its length. I've used the stringi::stri_detect_regex rather than grep since the latter isn't vectorised on the pattern term. I've also assumed that the pattern argument is itself a valid regular expression and that any characters that need escaping (e.g. [, .) are escaped.

library("stringi")
strings <- c("A-B-B-C-C", "A-A-A-A-A-A-A", "B-B-B-C-A-A")
get_regex_fn_fractions <- function(strings, pattern, which_fraction, n_groups = 3) {
  before <- round(nchar(strings) / n_groups * (which_fraction - 1))
  after <- round(nchar(strings) / n_groups * (n_groups - which_fraction))
  sprintf("^.{%d}.*%s.*.{%d}$", before, pattern, after)
}
(patterns <- get_regex_thirds(strs, "A", 1))
#[1] "^.{0}.*A.*.{6}$" "^.{0}.*A.*.{9}$" "^.{0}.*A.*.{7}$"

#Test regexs:
stri_detect_regex(strings, patterns)
#[1]  TRUE  TRUE FALSE
like image 23
Nick Kennedy Avatar answered Sep 20 '22 15:09

Nick Kennedy