I have a question about extracting a part of a string. For example I have a string like this:
a <- "DP=26;AN=2;DB=1;AC=1;MQ=56;MZ=0;ST=5:10,7:2;CQ=SYNONYMOUS_CODING;GN=NOC2L;PA=1^1:0.720&2^1:0"
I need to extract everything between GN= and ;.So here it will be NOC2L.
Is that possible?
Note: This is INFO column form VCF file format. GN is Gene Name, so we want to extract gene name from INFO column.
The str_sub() function in stringr extracts parts of strings based on their location. As with all stringr functions, the first argument, string , is a vector of strings. The arguments start and end specify the boundaries of the piece to extract in characters.
The substr() method extracts a part of a string. The substr() method begins at a specified position, and returns a specified number of characters. The substr() method does not change the original string. To extract characters from the end of the string, use a negative start position.
substring() function in R Programming Language is used to extract substrings in a character vector. You can easily extract the required substring or character from the given string.
To get the first n characters from a string, we can use the built-in substr() function in R. The substr() function takes 3 arguments, the first one is a string, the second is start position, third is end position. Note: The negative values count backward from the last character.
Try this:
sub(".*?GN=(.*?);.*", "\\1", a)
# [1] "NOC2L"
                        Assuming semicolons separate your elements, and equals signs occur exclusively between key/value pairs, a non-strictly-regex method would be:
bits <- unlist(strsplit(a, ';'))
do.call(rbind, strsplit(bits, '='))
      [,1] [,2]               
 [1,] "DP" "26"               
 [2,] "AN" "2"                
 [3,] "DB" "1"                
 [4,] "AC" "1"                
 [5,] "MQ" "56"               
 [6,] "MZ" "0"                
 [7,] "ST" "5:10,7:2"         
 [8,] "CQ" "SYNONYMOUS_CODING"
 [9,] "GN" "NOC2L"            
[10,] "PA" "1^1:0.720&2^1:0"  
Then it's just a matter of selecting the appropriate element.
One way would be:
gsub(".+=(\\w+);.+", "\\1", a, perl=T)
I am sure there are more elegant ways to do it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With