Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Concatenate several columns to comma separated strings by group

Tags:

r

Background: I am in the process of annotating SNPs from a GWAS in an organism without much annotation. I am using the chained tBLASTn table from UCSC along with biomaRt to map each SNP to a probable gene(s).

I have a dataframe that looks like this:

            SNP   hu_mRNA     gene
 chr1.111642529 NM_002107    H3F3A
 chr1.111642529 NM_005324    H3F3B
 chr1.111801684 BC098118     <NA>
 chr1.111925084 NM_020435    GJC2
  chr1.11801605 AK027740     <NA>
  chr1.11801605 NM_032849    C13orf33
 chr1.151220354 NM_018913    PCDHGA10
 chr1.151220354 NM_018918    PCDHGA5

What I would like to end up with is a single row for each SNP, and comma delimit the genes and hu_mRNAs. Here is what I am after:

            SNP            hu_mRNA    gene
 chr1.111642529 NM_002107,NM_005324   H3F3A
 chr1.111801684  BC098118,NM_020435   GJC2
  chr1.11801605  AK027740,NM_032849   C13orf33
 chr1.151220354 NM_018913,NM_018918   PCDHGA10,PCDHGA5

Now I know I can do this with a flick of the wrist in perl, but I really want to do this all in R. Any suggestions?

like image 386
caddymob Avatar asked Nov 10 '11 17:11

caddymob


3 Answers

You could do this in one line using plyr, as it is a classic split-apply-combine problem. You split using SNP, apply paste with collapse and assemble the pieces back into a data frame.

plyr::ddply(x, .(SNP), colwise(paste), collapse = ",")

If you want to do data reshaping in R at the flick of a wrist, learn plyr and reshape2 :). Another flick of the wrist solution using data.table, really useful if you are dealing with massive amounts of data.

data.table::data.table(x)[,lapply(.SD, paste, collapse = ","),'SNP']
like image 176
Ramnath Avatar answered Oct 03 '22 20:10

Ramnath


You can use aggregate with paste for each one and merge at the end:

x <- structure(list(SNP = structure(c(1L, 1L, 2L, 3L, 4L, 4L, 5L, 
5L), .Label = c("chr1.111642529", "chr1.111801684", "chr1.111925084", 
"chr1.11801605", "chr1.151220354"), class = "factor"), hu_mRNA = structure(c(3L, 
4L, 2L, 7L, 1L, 8L, 5L, 6L), .Label = c("AK027740", "BC098118", 
"NM_002107", "NM_005324", "NM_018913", "NM_018918", "NM_020435", 
"NM_032849"), class = "factor"), gene = structure(c(4L, 5L, 1L, 
3L, 1L, 2L, 6L, 7L), .Label = c("<NA>", "C13orf33", "GJC2", "H3F3A", 
"H3F3B", "PCDHGA10", "PCDHGA5"), class = "factor")), .Names = c("SNP", 
"hu_mRNA", "gene"), class = "data.frame", row.names = c(NA, -8L
))

a1 <- aggregate(hu_mRNA~SNP,data=x,paste,sep=",")
a2 <- aggregate(gene~SNP,data=x,paste,sep=",")
merge(a1,a2)
             SNP              hu_mRNA              gene
1 chr1.111642529 NM_002107, NM_005324      H3F3A, H3F3B
2 chr1.111801684             BC098118              <NA>
3 chr1.111925084            NM_020435              GJC2
4  chr1.11801605  AK027740, NM_032849    <NA>, C13orf33
5 chr1.151220354 NM_018913, NM_018918 PCDHGA10, PCDHGA5
like image 36
James Avatar answered Oct 03 '22 22:10

James


First set up the test data. Note that we have made the columns to be of "character" class rather than "factor" by using as.is=TRUE :

Lines <- "SNP   hu_mRNA     gene
 chr1.111642529 NM_002107    H3F3A
 chr1.111642529 NM_005324    H3F3B
 chr1.111801684 BC098118     <NA>
 chr1.111925084 NM_020435    GJC2
  chr1.11801605 AK027740     <NA>
  chr1.11801605 NM_032849    C13orf33
 chr1.151220354 NM_018913    PCDHGA10
 chr1.151220354 NM_018918    PCDHGA5"
cat(Lines, "\n", file = "data.txt")
DF <- read.table("data.txt", header = TRUE, na.strings = "<NA>", as.is = TRUE)

Now try this aggregate statement:

> aggregate(. ~ SNP, DF, toString)
             SNP              hu_mRNA              gene
1 chr1.111642529 NM_002107, NM_005324      H3F3A, H3F3B
2 chr1.111925084            NM_020435              GJC2
3  chr1.11801605            NM_032849          C13orf33
4 chr1.151220354 NM_018913, NM_018918 PCDHGA10, PCDHGA5
like image 21
G. Grothendieck Avatar answered Oct 03 '22 21:10

G. Grothendieck