My problem is I need to find a position in a string where I have blocks of characters which should really only be a single character position. I am working with nucleotide sequences where I need to keep track of positions within the sequence, but I have some positions where there are variants which have been denoted as [A/T] where either an A or T could be present depending on which sequence I care about (this is two similar DNA sequences which vary at a couple positions throughout the sequence). So for every one of these variant sites, the length of the sequence is an extra four characters/positions longer.
I know I could get around this by making a new code where [A/T] can be converted to, say X and [T/A] is represented by Y, but this will get confusing because there is already a standard degeneracy code, but it won't keep track of which nucleotide is from which strain (for me the one before the / is from strain A and the one after the / is from strain B). I want to index this DNA sequence string somehow, I was thinking like this below:
If I have a string like:
dna <- "ATC[A/T]G[G/C]ATTACAATCG"
I would like to get a table/data.frame:
pos nuc
1 A
2 T
3 C
4 [A/T]
5 G
6 [G/C]
... and so on
I feel like I could use strplit somehow if I knew regex better. Can I insert a condition to split at every character unless bound by square brackets which should be kept as a block?
library('stringr')
df <- as.data.frame(strsplit(gsub("\\[./.\\]", '_', dna), ''), stringsAsFactors=F)
df[,1][df[,1] == '_'] <- str_extract_all(dna, "\\[./.\\]")[[1]];names(df) <- 'nuc'
df
# nuc
# 1 A
# 2 T
# 3 C
# 4 [A/T]
# 5 G
# 6 [G/C]
# 7 A
# 8 T
# 9 T
# 10 A
# 11 C
# 12 A
# 13 A
# 14 T
# 15 C
# 16 G
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With