Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to very efficiently extract specific pattern from characters?

Tags:

regex

r

I have big data like this :

> Data[1:7,1]
[1] mature=hsa-miR-5087|mir_Family=-|Gene=OR4F5        
[2] mature=hsa-miR-26a-1-3p|mir_Family=mir-26|Gene=OR4F9
[3] mature=hsa-miR-448|mir_Family=mir-448|Gene=OR4F5   
[4] mature=hsa-miR-659-3p|mir_Family=-|Gene=OR4F5      
[5] mature=hsa-miR-5197-3p|mir_Family=-|Gene=OR4F5     
[6] mature=hsa-miR-5093|mir_Family=-|Gene=OR4F5        
[7] mature=hsa-miR-650|mir_Family=mir-650|Gene=OR4F5

what I want to do is that, in every row, I want to select the name after word mature= and also the word after Gene= and then pater them together with

paste(a,b, sep="-")

for example, the expected output from first two rows would be like :

hsa-miR-5087-OR4F5
hsa-miR-26a-1-3p-OR4F9

so, the final implementation is like this:

for(i in 1:nrow(Data)){
    Data[i,3] <- sub("mature=([^|]*).*Gene=(.*)", "\\1-\\2", Data[i,1])
    Name <- strsplit(as.vector(Data[i,2]),"\\|")[[1]][2]
    Data[i,4] <- as.numeric(sub("pvalue=","",Name))
    print(i)
}

which work well, but it's very slow. the size of Data is very big and it has 200,000,000 rows. this implementation is very slow for that. how can I speed it up ?

like image 525
Robin Avatar asked Jan 06 '15 13:01

Robin


1 Answers

If you can guarantee that the format is exactly as you specified, then a regular expression can capture (denoted by the brackets below) everything from the equals sign upto the pipe symbol, and from the Gene= to the end, and paste them together with a minus sign:

sub("mature=([^|]*).*Gene=(.*)", "\\1-\\2", Data[,1])
like image 143
Gavin Kelly Avatar answered Sep 27 '22 19:09

Gavin Kelly