Subset string by counting specific characters

Question

I have the following strings:

strings <- c("ABBSDGNHNGA", "AABSDGDRY", "AGNAFG", "GGGDSRTYHG")

I want to cut off the string, as soon as the number of occurances of A, G and N reach a certain value, say 3. In that case, the result should be:

some_function(strings)

c("ABBSDGN", "AABSDG", "AGN", "GGG")

I tried to use the stringi, stringr and regex expressions but I can't figure it out.

Cameron Bieganek · Accepted Answer

You can accomplish your task with a simple call to str_extract from the stringr package:

library(stringr)

strings <- c("ABBSDGNHNGA", "AABSDGDRY", "AGNAFG", "GGGDSRTYHG")

str_extract(strings, '([^AGN]*[AGN]){3}')
# [1] "ABBSDGN" "AABSDG"  "AGN"     "GGG"

The [^AGN]*[AGN] portion of the regex pattern says to look for zero or more consecutive characters that are not A, G, or N, followed by one instance of A, G, or N. The additional wrapping with parenthesis and braces, like this ([^AGN]*[AGN]){3}, means look for that pattern three times consecutively. You can change the number of occurrences of A, G, N, that you are looking for by changing the integer in the curly braces:

str_extract(strings, '([^AGN]*[AGN]){4}')
# [1] "ABBSDGNHN"  NA           "AGNA"       "GGGDSRTYHG"

There are a couple ways to accomplish your task using base R functions. One is to use regexpr followed by regmatches:

m <- regexpr('([^AGN]*[AGN]){3}', strings)
regmatches(strings, m)
# [1] "ABBSDGN" "AABSDG"  "AGN"     "GGG"

Alternatively, you can use sub:

sub('(([^AGN]*[AGN]){3}).*', '\1', strings)
# [1] "ABBSDGN" "AABSDG"  "AGN"     "GGG"

Maurits Evers · Answer

Here is a base R option using strsplit

sapply(strsplit(strings, ""), function(x)
    paste(x[1:which.max(cumsum(x %in% c("A", "G", "N")) == 3)], collapse = ""))
#[1] "ABBSDGN" "AABSDG"  "AGN"     "GGG"

Or in the tidyverse

library(tidyverse)
map_chr(str_split(strings, ""), 
    ~str_c(.x[1:which.max(cumsum(.x %in% c("A", "G", "N")) == 3)], collapse = ""))

Subset string by counting specific characters

Tags:

regex

r

gsub

stringr

stringi

Nivel

2 Answers

Cameron Bieganek

Maurits Evers

Recent Activity

Donate For Us

Subset string by counting specific characters

Tags:

regex

r

gsub

stringr

stringi

Nivel

2 Answers

Cameron Bieganek

Maurits Evers

Related questions

Recent Activity

Donate For Us