How to compare two datasets df1 and df2 by Gene names and extract the corresponding values for each gene names from df2 and insert it in to df1
df1 <-
Genes sample.ID chrom loc.start loc.end num.mark
Klri2 LO.WGS 1 3010000 173490000 8430
Rrs1 LO.WGS 1 3010000 173490000 8430
Serpin LO.WGS 1 3010000 173490000 8430
Myoc LO.WGS 1 3010000 173490000 8430
St18 LO.WGS 1 3010000 173490000 8430
df2 <-
RL pValue. chr start end CNA Genes
2 2.594433 1 129740006 129780779 gain Klri2
2 3.941399 1 130080653 130380997 gain Serpin,St18,Myoc
df3<-
Genes sample.ID chrom loc.start loc.end num.mark RL pValue CNA
Klri2 LO.WGS 1 3010000 173490000 8430 2 2.594433 gain
Rrs1 LO.WGS 1 3010000 173490000 8430 0 0 0
Serpin LO.WGS 1 3010000 173490000 8430 2 3.941399 gain
Myoc LO.WGS 1 3010000 173490000 8430 2 3.941399 gain
St18 LO.WGS 1 3010000 173490000 8430 2 3.941399 gain
You could try:
library(splitstackshape)
out <- cSplit(df2, "Genes", sep = ",", "long")
This will reshape df2 in the proper format (one row for each gene):
# RL pValue. chr start end CNA Genes
#1: 2 2.594433 1 129740006 129780779 gain Klri2
#2: 2 3.941399 1 130080653 130380997 gain Serpin
#3: 2 3.941399 1 130080653 130380997 gain St18
#4: 2 3.941399 1 130080653 130380997 gain Myoc
Then you simply have to use merge() or left_join() from dplyr:
library(dplyr)
df3 <- left_join(df1, out)
If you want to replace NAs by 0, you could do:
df3 <- left_join(df1, out) %>% mutate_each(funs(ifelse(is.na(.), 0, .)))
Or, if you prefer subsetting:
df3 <- left_join(df1, out) %>% (function(x) { x[is.na(x)] <- 0; x })
This is a merge operation but first you have to bring df2 in the right format, which will contain one row for each gene (rather than a single entry for multiple genes separated by comma). There is a convenient function to do this from the tidyr package, unnest()
df2 <- tidyr::unnest(
transform(df2, Genes = strsplit(as.character(df2$Genes), ",")),
Genes)
The result looks like this
df2
# RL pValue. chr start end CNA Genes
#1 2 2.594433 1 129740006 129780779 gain Klri2
#2 2 3.941399 1 130080653 130380997 gain Serpin
#3 2 3.941399 1 130080653 130380997 gain St18
#4 2 3.941399 1 130080653 130380997 gain Myoc
Now you can simply use merge(df1, df2, all.x = TRUE) or left_join from dplyr (or other packages like data.table, depending on which one you want to learn). Note that this will introduce NAs where you want to have zeros but you can easily replace these.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With