Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding the longest stretch of repeated words in a long string of characters

I have a long DNA sequence text file with characters (ATCG). I am looking for some method in R that can be used to find the longest stretch with repeated words. Lets say my string looks like, AAGTGCGGGTTCAGATCGCCCCCCCATCGGGCAAAAAAAAAAAAAAAATCGA

I need the output possibly with counts, AAAAAAAAAAAAAAAA n=16

Please help me with this.

like image 582
Muhammad Avatar asked Sep 16 '25 22:09

Muhammad


1 Answers

if you have one string:

library(tidyverse)
string <- "AAGTGCGGGTTCAGATCGCCCCCCCATCGGGCAAAAAAAAAAAAAAAATCGA"

x <- str_extract_all(string, "(.)\\1+")
x[which.max(nchar(x))]

[1] "AAAAAAAAAAAAAAAA"

if you have many strings:

str_extract_all(c(string, string), "(.)\\1+")%>%
  map_chr(~.x[which.max(nchar(.x))])

[1] "AAAAAAAAAAAAAAAA" "AAAAAAAAAAAAAAAA"

To find the counts, just use nchar or even str_count of the result

like image 106
KU99 Avatar answered Sep 18 '25 13:09

KU99