Imagine I have a set of strings, say: <pre class="prettyprint"><code>#1: "A-B-B-C-C" #2: "A-A-A-A-A-A-A" #3: "B-B-B-C-A-A" </code></pre> Now I want to check whether certain patterns occur in the first, middle, or last third in the sequence. Hence, I want to be able to formulate a rule of the kind: <pre class="prettyprint"><code>Match the string if, and only if, marker X occurs in the first/middle/last third of the string </code></pre> For example, I may want to match strings which have an <code>A</code> in the first third. The considering the sequences above I would match <code>#1</code> and <code>#2</code>. I could also want to match strings which have an <code>A</code> in the last third. This would match <code>#2</code> and <code>#3</code>. How can I write a generic code/regex pattern that can take various rules of this kind as input and then match the appropriate subsequences?

Here's a fully vectorized attempt (you can play around with the settings and tell me if you want to add/change something) <pre class="prettyprint"><code>StriDetect <- function(x, seg = 1L, pat = "A", frac = 3L, fixed = TRUE, values = FALSE){ xsub <- gsub("-", "", x, fixed = TRUE) sizes <- nchar(xsub) / frac subs <- substr(xsub, sizes * (seg - 1L) + 1L, sizes * seg) if(isTRUE(values)) x[grep(pat, subs, fixed = fixed)] else grep(pat, subs, fixed = fixed) } </code></pre> Testing on your vector <pre class="prettyprint"><code>x <- c("A-B-B-C-C", "A-A-A-A-A-A-A", "B-B-B-C-A-A") StriDetect(x, 1L, "A") ## [1] 1 2 StriDetect(x, 3L, "A") ## [1] 2 3 </code></pre> Or if you want the actual matched strings <pre class="prettyprint"><code>StriDetect(x, 1L, "A", values = TRUE) ## [1] "A-B-B-C-C" "A-A-A-A-A-A-A" StriDetect(x, 3L, "A", values = TRUE) ## [1] "A-A-A-A-A-A-A" "B-B-B-C-A-A" </code></pre> <hr> Please note that when the string size doesn't divides exactly by 3 (for example, <code>nchar(x) == 10</code>), the last third is the largest group by default (e.g. size 4 if <code>nchar(x) == 10</code>)

Here's a solution which generates regexes to meet the desired requirements. Note regexes can count, but they can't count relative to the total string. So this generates a custom regex for each input string based on its length. I've used the <code>stringi::stri_detect_regex</code> rather than <code>grep</code> since the latter isn't vectorised on the pattern term. I've also assumed that the <code>pattern</code> argument is itself a valid regular expression and that any characters that need escaping (e.g. <code>[</code>, <code>.</code>) are escaped. <pre class="prettyprint"><code>library("stringi") strings <- c("A-B-B-C-C", "A-A-A-A-A-A-A", "B-B-B-C-A-A") get_regex_fn_fractions <- function(strings, pattern, which_fraction, n_groups = 3) { before <- round(nchar(strings) / n_groups * (which_fraction - 1)) after <- round(nchar(strings) / n_groups * (n_groups - which_fraction)) sprintf("^.{%d}.*%s.*.{%d}$", before, pattern, after) } (patterns <- get_regex_thirds(strs, "A", 1)) #[1] "^.{0}.*A.*.{6}$" "^.{0}.*A.*.{9}$" "^.{0}.*A.*.{7}$" #Test regexs: stri_detect_regex(strings, patterns) #[1] TRUE TRUE FALSE </code></pre>

Identifying strings based on where substrings appear in the string

Tags:

string

regex

r

Imagine I have a set of strings, say:

Click to copy

#1: "A-B-B-C-C"
#2: "A-A-A-A-A-A-A"
#3: "B-B-B-C-A-A"

Now I want to check whether certain patterns occur in the first, middle, or last third in the sequence. Hence, I want to be able to formulate a rule of the kind:

Click to copy

Match the string if, and only if, 
marker X occurs in the first/middle/last third of the string

For example, I may want to match strings which have an A in the first third. The considering the sequences above I would match #1 and #2. I could also want to match strings which have an A in the last third. This would match #2 and #3.

How can I write a generic code/regex pattern that can take various rules of this kind as input and then match the appropriate subsequences?

948

asked Jul 16 '15 09:07

histelheim

2 Answers

Here's a fully vectorized attempt (you can play around with the settings and tell me if you want to add/change something)

Click to copy

StriDetect <- function(x, seg = 1L, pat = "A", frac = 3L, fixed = TRUE, values = FALSE){
  xsub <- gsub("-", "", x, fixed = TRUE)
  sizes <- nchar(xsub) / frac
  subs <- substr(xsub, sizes * (seg - 1L) + 1L, sizes * seg)
  if(isTRUE(values)) x[grep(pat, subs, fixed = fixed)] else grep(pat, subs, fixed = fixed)
}

Testing on your vector

Click to copy

x <- c("A-B-B-C-C", "A-A-A-A-A-A-A", "B-B-B-C-A-A")
StriDetect(x, 1L, "A")
## [1] 1 2
StriDetect(x, 3L, "A")
## [1] 2 3

Or if you want the actual matched strings

Click to copy

StriDetect(x, 1L, "A", values = TRUE)
## [1] "A-B-B-C-C"     "A-A-A-A-A-A-A"
StriDetect(x, 3L, "A", values = TRUE)
## [1] "A-A-A-A-A-A-A" "B-B-B-C-A-A"

Please note that when the string size doesn't divides exactly by 3 (for example, nchar(x) == 10), the last third is the largest group by default (e.g. size 4 if nchar(x) == 10)

187

answered Sep 24 '22 15:09

David Arenburg

Here's a solution which generates regexes to meet the desired requirements. Note regexes can count, but they can't count relative to the total string. So this generates a custom regex for each input string based on its length. I've used the stringi::stri_detect_regex rather than grep since the latter isn't vectorised on the pattern term. I've also assumed that the pattern argument is itself a valid regular expression and that any characters that need escaping (e.g. [, .) are escaped.

Click to copy

library("stringi")
strings <- c("A-B-B-C-C", "A-A-A-A-A-A-A", "B-B-B-C-A-A")
get_regex_fn_fractions <- function(strings, pattern, which_fraction, n_groups = 3) {
  before <- round(nchar(strings) / n_groups * (which_fraction - 1))
  after <- round(nchar(strings) / n_groups * (n_groups - which_fraction))
  sprintf("^.{%d}.*%s.*.{%d}$", before, pattern, after)
}
(patterns <- get_regex_thirds(strs, "A", 1))
#[1] "^.{0}.*A.*.{6}$" "^.{0}.*A.*.{9}$" "^.{0}.*A.*.{7}$"

#Test regexs:
stri_detect_regex(strings, patterns)
#[1]  TRUE  TRUE FALSE

answered Sep 20 '22 15:09

Nick Kennedy

Related questions
                            
                                A Regex to match any sentence but avoiding character repetition
                            
                                Python Selenium find element by link text contains a string with wildcard or regex
                            
                                Oracle - need to extract text between given strings
                            
                                Go equivalent to PHP preg_match
                            
                                Find the end offset of a matched string or regex
                            
                                R: workaround for variable-width lookbehind
                            
                                How to implement a language interpreter without regular expressions?
                            
                                regexp for parsing xml to array
                            
                                grep regex lookahead or start of string (or lookbehind or end of string)
                            
                                How can I use vim regex to replace text when math divide is involved in the expression
                            
                                How can I test if a string the last “part” of another string?
                            
                                Why doesn't this simple bash regex return true?
                            
                                What do these JS shorthand characters mean? [duplicate]
                            
                                difference with Kleen regex expression
                            
                                Extracting specific src attributes from script tags
                            
                                How to Python split by a character yet maintain that character?
                            
                                Match exactly one occurrence with regex
                            
                                String regex two mismatches Python
                            
                                Java Regex does not match newline
                            
                                Match everything but numbers regular expression

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Identifying strings based on where substrings appear in the string

Tags:

string

regex

r

histelheim

People also ask

2 Answers

David Arenburg

Nick Kennedy

Recent Activity

Donate For Us