Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract the middle part of a string in a data frame in R?

I have a proteomic data frame with several columns. One of them is a called Description, in which we have the name of protein, OS, gene name (GN), PE, and SV, as I shown below.

> head(pccmit$Description)
[1] "Protein NDRG4 OS=Homo sapiens GN=NDRG4 PE=1 SV=2"                                   
[2] "V-type proton ATPase subunit B_ brain isoform OS=Homo sapiens GN=ATP6V1B2 PE=1 SV=3"
[3] "Serotransferrin OS=Homo sapiens GN=TF PE=1 SV=3"                                    
[4] "Glutaminase kidney isoform_ mitochondrial OS=Homo sapiens GN=GLS PE=1 SV=1"         
[5] "Adenylate kinase isoenzyme 1 OS=Homo sapiens GN=AK1 PE=1 SV=3"                      
[6] "Sideroflexin-1 OS=Homo sapiens GN=SFXN1 PE=1 SV=4"

Then, I'd like to extract just the gene name of that proteins

I've tried to use the :str_extract tool, however, it is not working. Maybe because i'm not using the correct pattern required by the tool

str_extract(A$Description, start = "GN=", end = " PE")

I would like that have a data frame with these gene names

> head(pccmit$Description)
[1] NDRG4
[2] ATP6V1B2
[3] TF

Thank you all, guys

like image 709
Guilherme Reis Avatar asked Dec 10 '22 02:12

Guilherme Reis


2 Answers

Using the stringr package:

library(stringr)
str_extract(pccmit$Description, "(?<=GN=).*(?= PE)")

(?<=GN=) looks behind after GN= and (?= PE) looks ahead of = PE, with .* matching everything in the middle.

like image 109
sumshyftw Avatar answered Jan 25 '23 23:01

sumshyftw


Here are some alternatives. No packages are used except for (5).

1) sub Using Lines shown in the Note at the end and assuming that the gene name does not include any whitespace, this matches everything up to GN=, then captures subsequent non-whitespace and then matches everything replacing everything with the captured portion, i.e. the non-whitespace following GN=. No packages are used.

sub(".*GN=(\\S+).*", "\\1", Lines)
## [1] "NDRG4"    "ATP6V1B2" "TF"       "GLS"      "AK1"      "SFXN1"   

2) Another approach is to remove everything up to and including GN= and then everything from the subsequent whitespace onwards:

gsub(".*GN=|\\s.*", "", Lines)
## [1] "NDRG4"    "ATP6V1B2" "TF"       "GLS"      "AK1"      "SFXN1"   

3) read.dcf Another alternative is to convert the data into DCF format and then read it in using read.dcf. This will parse all fields and derive their names from the data itself producing matrix m.

g <- paste0("\nX:", gsub("(\\S+)=", "\n\\1:", Lines))

m <- read.dcf(textConnection(g))
m
##      X                                               OS             GN         PE  SV 
## [1,] "Protein NDRG4"                                 "Homo sapiens" "NDRG4"    "1" "2"
## [2,] "V-type proton ATPase subunit B_ brain isoform" "Homo sapiens" "ATP6V1B2" "1" "3"
## [3,] "Serotransferrin"                               "Homo sapiens" "TF"       "1" "3"
## [4,] "Glutaminase kidney isoform_ mitochondrial"     "Homo sapiens" "GLS"      "1" "1"
## [5,] "Adenylate kinase isoenzyme 1"                  "Homo sapiens" "AK1"      "1" "3"
## [6,] "Sideroflexin-1"                                "Homo sapiens" "SFXN1"    "1" "4"

m[, "GN"]
## [1] "NDRG4"    "ATP6V1B2" "TF"       "GLS"      "AK1"      "SFXN1"   

4) strcapture Another appeoach to parsing all fields is to use strcapture. This returns a data frame whereas read.dcf returns a matrix. This solution requires that we specify the fields whereas (3) derives them.

strcapture("(.*) OS=(.*) GN=(.*) PE=(.*) SV=(.*)", Lines,
  list(X = character(0), OS = character(0), GN = character(0), 
    PE = numeric(0), SV = numeric(0)))

giving this data.frame:

                                              X           OS       GN PE SV
1                                 Protein NDRG4 Homo sapiens    NDRG4  1  2
2 V-type proton ATPase subunit B_ brain isoform Homo sapiens ATP6V1B2  1  3
3                               Serotransferrin Homo sapiens       TF  1  3
4     Glutaminase kidney isoform_ mitochondrial Homo sapiens      GLS  1  1
5                  Adenylate kinase isoenzyme 1 Homo sapiens      AK1  1  3
6                                Sideroflexin-1 Homo sapiens    SFXN1  1  4

If DF is that data frame then DF$GN are the gene names.

5) strapplyc Specify a pattern consisting of GN= followed by non-whitespace and put the latter in a capture group which is returned. This one has the simplest regular expression of any of the alternatives here.

library(gsubfn)
strapplyc(Lines, "GN=(\\S+)", simplify = TRUE)
## [1] "NDRG4"    "ATP6V1B2" "TF"       "GLS"      "AK1"      "SFXN1"   

Note

Lines <- c("Protein NDRG4 OS=Homo sapiens GN=NDRG4 PE=1 SV=2",
 "V-type proton ATPase subunit B_ brain isoform OS=Homo sapiens GN=ATP6V1B2 PE=1 SV=3",
 "Serotransferrin OS=Homo sapiens GN=TF PE=1 SV=3",
 "Glutaminase kidney isoform_ mitochondrial OS=Homo sapiens GN=GLS PE=1 SV=1",
 "Adenylate kinase isoenzyme 1 OS=Homo sapiens GN=AK1 PE=1 SV=3",        
 "Sideroflexin-1 OS=Homo sapiens GN=SFXN1 PE=1 SV=4")
like image 30
G. Grothendieck Avatar answered Jan 25 '23 23:01

G. Grothendieck