Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count the number of pattern matches in a string

Tags:

string

regex

r

For example, I have a string

"AAAAAAACGAAAAAACGAAADGCGEDCG"

I want to count how many times "CG" is repeated. How do I do that?

like image 913
lgxqzz Avatar asked Jan 15 '14 20:01

lgxqzz


3 Answers

You can use gregexpr to find the positions of "CG" in vec. We have to check whether there was no match (-1). The function sum counts the number of matches.

> vec <- "AAAAAAACGAAAAAACGAAADGCGEDCG"
> sum(gregexpr("CG", vec)[[1]] != -1)
[1] 4

If you have a vector of strings, you can use sapply:

> vec <- c("ACACACACA", "GGAGGAGGAG", "AACAACAACAAC", "GGCCCGCCGC", "TTTTGTT", "AGAGAGA")
> sapply(gregexpr("CG", vec), function(x) sum(x != -1))
[1] 0 0 0 2 0 0

If you have a list of strings, you can use unlist(vec) and then use the solution above.

like image 144
Sven Hohenstein Avatar answered Sep 20 '22 07:09

Sven Hohenstein


The Bioconductor package Biostrings has a matchPattern function

countGC <- matchPattern("GC",DNSstring_object)

Note that DNSstring_object is FASTA sequence read in using the biostring function readDNAStringSet or readAAStringSet

like image 44
JeremyS Avatar answered Sep 21 '22 07:09

JeremyS


Use str_count from stringr. It's simple to remember and read, though not a base function.

library(stringr)
str_count("AAAAAAACGAAAAAACGAAADGCGEDCG", "CG")
# [1] 4
like image 27
Hugh Avatar answered Sep 21 '22 07:09

Hugh