I am handling a microarray data.
I have two tables, one is pathway and gene set table (I will call it as A table) and the other is microarray table (Lets say it B)
I need to change gene symbols(characters) to expression value(numbers) in A table according to each expression value of gene symbols in B
Tables look like followings
A table B table
Pathway v1 v2 ...v249 v250 Gene Value
1 A E NA NA E 1000
2 B A Z I A 500
3 C G X NA G 200
4 D K P NA B 300
P 10
Z 20
I want to change A table like following way
A table
Pathway v1 v2 ... v249 v250
1 500 1000 NA NA
2 300 500 20 NA
3 NA 200 NA NA
4 NA NA 10 NA
If there are no matched gene symbols, they should be replaced with 'NA'
We can also do this using base R
. We convert the subset of 'A' (i.e. except the 'Pathway' column) to matrix
, match
with 'Gene' from 'B', the numeric index obtained can be used to populate the corresponding 'Value' column, and assign the output back.
A1 <- A
A1[-1] <- B$Value[match(as.matrix(A[-1]), B$Gene)]
A1
# Pathway v1 v2
#1 1 500 1000
#2 2 300 500
#3 3 NA 200
#4 4 NA NA
NOTE: Datasets from @DavidArenburg's post.
I would suggest, first melting, then merging, the dcasting back. This will work for any number of columns in the A
data set. I will be using the latest data.table
version on CRAN for this (v 1.9.6+)
library(data.table) # V 1.9.6+
res <- melt(setDT(A), id = "Pathway")[setDT(B), Value := i.Value, on = c(value = "Gene")]
dcast(res, Pathway ~ variable, value.var = "Value")
# Pathway v1 v2
# 1: 1 500 1000
# 2: 2 300 500
# 3: 3 NA 200
# 4: 4 NA NA
Or similarly using Hadleyverse
library(dplyr)
library(tidyr)
A %>%
gather(res, Gene, -Pathway) %>%
left_join(., B, by = "Gene") %>%
select(-Gene) %>%
spread(res, Value)
# Pathway v1 v2
# 1 1 500 1000
# 2 2 300 500
# 3 3 NA 200
# 4 4 NA NA
Data
A <- structure(list(Pathway = 1:4, v1 = structure(1:4, .Label = c("A",
"B", "C", "D"), class = "factor"), v2 = structure(c(2L, 1L, 3L,
4L), .Label = c("A", "E", "G", "K"), class = "factor")), .Names = c("Pathway",
"v1", "v2"), class = "data.frame", row.names = c(NA, -4L))
B <- structure(list(Gene = structure(c(3L, 1L, 4L, 2L), .Label = c("A",
"B", "E", "G"), class = "factor"), Value = c(1000L, 500L, 200L,
300L)), .Names = c("Gene", "Value"), class = "data.frame", row.names = c(NA,
-4L))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With