Compare split and merge two dataframe

Question

How to compare two datasets df1 and df2 by Gene names and extract the corresponding values for each gene names from df2 and insert it in to df1

df1 <-

Genes    sample.ID  chrom   loc.start   loc.end num.mark
Klri2     LO.WGS      1   3010000 173490000     8430
Rrs1      LO.WGS      1   3010000 173490000     8430
Serpin    LO.WGS      1   3010000 173490000     8430
Myoc      LO.WGS          1   3010000 173490000     8430
St18      LO.WGS          1   3010000 173490000     8430


df2 <-

    RL  pValue.   chr   start            end    CNA     Genes
    2   2.594433   1    129740006   129780779   gain    Klri2   
    2   3.941399   1    130080653   130380997   gain    Serpin,St18,Myoc

df3<-

Genes   sample.ID  chrom  loc.start  loc.end num.mark   RL  pValue      CNA
Klri2    LO.WGS     1   3010000   173490000     8430    2   2.594433    gain
Rrs1     LO.WGS     1   3010000   173490000     8430    0     0          0
Serpin   LO.WGS     1   3010000   173490000     8430    2   3.941399    gain
Myoc     LO.WGS     1   3010000   173490000     8430    2   3.941399    gain
St18     LO.WGS     1   3010000   173490000     8430    2   3.941399    gain

Steven Beaupré · Accepted Answer

You could try:

library(splitstackshape)   
out <- cSplit(df2, "Genes", sep = ",", "long")

This will reshape df2 in the proper format (one row for each gene):

#   RL  pValue. chr     start       end  CNA  Genes
#1:  2 2.594433   1 129740006 129780779 gain  Klri2
#2:  2 3.941399   1 130080653 130380997 gain Serpin
#3:  2 3.941399   1 130080653 130380997 gain   St18
#4:  2 3.941399   1 130080653 130380997 gain   Myoc

Then you simply have to use merge() or left_join() from dplyr:

library(dplyr)
df3 <- left_join(df1, out)

If you want to replace NAs by 0, you could do:

df3 <- left_join(df1, out) %>% mutate_each(funs(ifelse(is.na(.), 0, .)))

Or, if you prefer subsetting:

df3 <- left_join(df1, out) %>% (function(x) { x[is.na(x)] <- 0; x })

konvas · Answer

This is a merge operation but first you have to bring df2 in the right format, which will contain one row for each gene (rather than a single entry for multiple genes separated by comma). There is a convenient function to do this from the tidyr package, unnest()

df2 <- tidyr::unnest(
         transform(df2, Genes = strsplit(as.character(df2$Genes), ",")),
         Genes)

The result looks like this

df2
#  RL  pValue. chr     start       end  CNA  Genes
#1  2 2.594433   1 129740006 129780779 gain  Klri2
#2  2 3.941399   1 130080653 130380997 gain Serpin
#3  2 3.941399   1 130080653 130380997 gain   St18
#4  2 3.941399   1 130080653 130380997 gain   Myoc

Now you can simply use merge(df1, df2, all.x = TRUE) or left_join from dplyr (or other packages like data.table, depending on which one you want to learn). Note that this will introduce NAs where you want to have zeros but you can easily replace these.

Compare split and merge two dataframe

Tags:

r

Kryo

2 Answers

Steven Beaupré

konvas

Recent Activity

Donate For Us

Compare split and merge two dataframe

Tags:

r

Kryo

2 Answers

Steven Beaupré

konvas

Related questions

Recent Activity

Donate For Us