I have big data like this :
> Data[1:7,1]
[1] mature=hsa-miR-5087|mir_Family=-|Gene=OR4F5
[2] mature=hsa-miR-26a-1-3p|mir_Family=mir-26|Gene=OR4F9
[3] mature=hsa-miR-448|mir_Family=mir-448|Gene=OR4F5
[4] mature=hsa-miR-659-3p|mir_Family=-|Gene=OR4F5
[5] mature=hsa-miR-5197-3p|mir_Family=-|Gene=OR4F5
[6] mature=hsa-miR-5093|mir_Family=-|Gene=OR4F5
[7] mature=hsa-miR-650|mir_Family=mir-650|Gene=OR4F5
what I want to do is that, in every row, I want to select the name after word mature= and also the word after Gene= and then pater them together with
paste(a,b, sep="-")
for example, the expected output from first two rows would be like :
hsa-miR-5087-OR4F5
hsa-miR-26a-1-3p-OR4F9
so, the final implementation is like this:
for(i in 1:nrow(Data)){
Data[i,3] <- sub("mature=([^|]*).*Gene=(.*)", "\\1-\\2", Data[i,1])
Name <- strsplit(as.vector(Data[i,2]),"\\|")[[1]][2]
Data[i,4] <- as.numeric(sub("pvalue=","",Name))
print(i)
}
which work well, but it's very slow. the size of Data is very big and it has 200,000,000 rows. this implementation is very slow for that. how can I speed it up ?
If you can guarantee that the format is exactly as you specified, then a regular expression can capture (denoted by the brackets below) everything from the equals sign upto the pipe symbol, and from the Gene= to the end, and paste them together with a minus sign:
sub("mature=([^|]*).*Gene=(.*)", "\\1-\\2", Data[,1])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With