Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R extract part of string

I have a question about extracting a part of a string. For example I have a string like this:

a <- "DP=26;AN=2;DB=1;AC=1;MQ=56;MZ=0;ST=5:10,7:2;CQ=SYNONYMOUS_CODING;GN=NOC2L;PA=1^1:0.720&2^1:0"

I need to extract everything between GN= and ;.So here it will be NOC2L.

Is that possible?

Note: This is INFO column form VCF file format. GN is Gene Name, so we want to extract gene name from INFO column.

like image 687
Lisann Avatar asked Mar 15 '12 13:03

Lisann


People also ask

How do I extract part of a string in R?

The str_sub() function in stringr extracts parts of strings based on their location. As with all stringr functions, the first argument, string , is a vector of strings. The arguments start and end specify the boundaries of the piece to extract in characters.

How do you extract a certain part of a string?

The substr() method extracts a part of a string. The substr() method begins at a specified position, and returns a specified number of characters. The substr() method does not change the original string. To extract characters from the end of the string, use a negative start position.

How do I extract part of a character in R?

substring() function in R Programming Language is used to extract substrings in a character vector. You can easily extract the required substring or character from the given string.

How do I get the first 4 characters of a string in R?

To get the first n characters from a string, we can use the built-in substr() function in R. The substr() function takes 3 arguments, the first one is a string, the second is start position, third is end position. Note: The negative values count backward from the last character.


3 Answers

Try this:

sub(".*?GN=(.*?);.*", "\\1", a)
# [1] "NOC2L"
like image 122
kohske Avatar answered Oct 22 '22 15:10

kohske


Assuming semicolons separate your elements, and equals signs occur exclusively between key/value pairs, a non-strictly-regex method would be:

bits <- unlist(strsplit(a, ';'))
do.call(rbind, strsplit(bits, '='))

      [,1] [,2]               
 [1,] "DP" "26"               
 [2,] "AN" "2"                
 [3,] "DB" "1"                
 [4,] "AC" "1"                
 [5,] "MQ" "56"               
 [6,] "MZ" "0"                
 [7,] "ST" "5:10,7:2"         
 [8,] "CQ" "SYNONYMOUS_CODING"
 [9,] "GN" "NOC2L"            
[10,] "PA" "1^1:0.720&2^1:0"  

Then it's just a matter of selecting the appropriate element.

like image 27
jbaums Avatar answered Oct 22 '22 13:10

jbaums


One way would be:

gsub(".+=(\\w+);.+", "\\1", a, perl=T)

I am sure there are more elegant ways to do it.

like image 3
johannes Avatar answered Oct 22 '22 13:10

johannes