Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract a fixed-length character in R

Tags:

split

r

I have an attribute consisting DNA sequences and would like to translate it to its amino name. So I need to split the sequence in a fixed-length character that is 3. Here is the sample of the data

data=c("AATAGACGT","TGACCC","AAATCACTCTTT")

How can I extract it into:

[1] "AAT" "AGA" "CGT"
[2] "TGA" "CCC" 
[3] "AAA" "TCA" "CTC" "TTT"

So far I can only find how to split a string given a certain regex as the separator

like image 879
Rochana Nana Avatar asked Dec 14 '22 15:12

Rochana Nana


2 Answers

Try

strsplit(data, '(?<=.{3})', perl=TRUE)

Or

library(stringi)
stri_extract_all_regex(data, '.{1,3}')
like image 83
akrun Avatar answered Jan 01 '23 19:01

akrun


Another solution, still one liner, but less elegant than the other ones (using lapply):

lapply(data, function(u) substring(u, seq(1, nchar(u), 3), seq(3, nchar(u),3)))
#[[1]]
#[1] "AAT" "AGA" "CGT"

#[[2]]
#[1] "TGA" "CCC"

#[[3]]
#[1] "AAA" "TCA" "CTC" "TTT"
like image 29
Colonel Beauvel Avatar answered Jan 01 '23 19:01

Colonel Beauvel