Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Subset string by counting specific characters

I have the following strings:

strings <- c("ABBSDGNHNGA", "AABSDGDRY", "AGNAFG", "GGGDSRTYHG") 

I want to cut off the string, as soon as the number of occurances of A, G and N reach a certain value, say 3. In that case, the result should be:

some_function(strings)

c("ABBSDGN", "AABSDG", "AGN", "GGG") 

I tried to use the stringi, stringr and regex expressions but I can't figure it out.

like image 443
Nivel Avatar asked Dec 27 '18 19:12

Nivel


2 Answers

You can accomplish your task with a simple call to str_extract from the stringr package:

library(stringr)

strings <- c("ABBSDGNHNGA", "AABSDGDRY", "AGNAFG", "GGGDSRTYHG")

str_extract(strings, '([^AGN]*[AGN]){3}')
# [1] "ABBSDGN" "AABSDG"  "AGN"     "GGG"

The [^AGN]*[AGN] portion of the regex pattern says to look for zero or more consecutive characters that are not A, G, or N, followed by one instance of A, G, or N. The additional wrapping with parenthesis and braces, like this ([^AGN]*[AGN]){3}, means look for that pattern three times consecutively. You can change the number of occurrences of A, G, N, that you are looking for by changing the integer in the curly braces:

str_extract(strings, '([^AGN]*[AGN]){4}')
# [1] "ABBSDGNHN"  NA           "AGNA"       "GGGDSRTYHG"

There are a couple ways to accomplish your task using base R functions. One is to use regexpr followed by regmatches:

m <- regexpr('([^AGN]*[AGN]){3}', strings)
regmatches(strings, m)
# [1] "ABBSDGN" "AABSDG"  "AGN"     "GGG"

Alternatively, you can use sub:

sub('(([^AGN]*[AGN]){3}).*', '\\1', strings)
# [1] "ABBSDGN" "AABSDG"  "AGN"     "GGG"
like image 119
Cameron Bieganek Avatar answered Oct 28 '22 09:10

Cameron Bieganek


Here is a base R option using strsplit

sapply(strsplit(strings, ""), function(x)
    paste(x[1:which.max(cumsum(x %in% c("A", "G", "N")) == 3)], collapse = ""))
#[1] "ABBSDGN" "AABSDG"  "AGN"     "GGG"

Or in the tidyverse

library(tidyverse)
map_chr(str_split(strings, ""), 
    ~str_c(.x[1:which.max(cumsum(.x %in% c("A", "G", "N")) == 3)], collapse = ""))
like image 42
Maurits Evers Avatar answered Oct 28 '22 10:10

Maurits Evers