Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Perl (or R, or SQL): Count how often string appears across columns

Tags:

string

mysql

r

perl

I have a text file that looks like this:

gene1   gene2   gene3
a       d       c
b       e       d
c       f       g
d       g       
        h
        i

(Each column is a human gene, and each contains a variable number of proteins (strings, shown as letters here) that can bind to those genes).

What I want to do is count how many columns each string is represented in, output that number and all the column headers, like this:

a   1   gene1
b   1   gene1
c   2   gene1 gene3
d   3   gene1 gene2 gene3
e   1   gene2
f   1   gene2
g   2   gene2 gene3
h   1   gene2
i   1   gene2

I have been trying to figure out how to do this in Perl and R, but without success so far. Thanks for any help.

like image 413
Stephen Turner Avatar asked Aug 04 '11 02:08

Stephen Turner


1 Answers

This solution seems like a bit of a hack, but it gives the desired output. It relies on using both plyr and reshape packages, though I'm sure you could find base R alternatives. The trick is that function melt lets us flatten the data out into a long format, which allows for easy(ish) manipulation from that point forward.

library(reshape)
library(plyr)

#Recreate your data
dat <- data.frame(gene1 = c(letters[1:4], NA, NA),
                  gene2 = letters[4:9],
                  gene3 = c("c", "d", "g", NA, NA, NA)
                  )

#Melt the data. You'll need to update this if you have more columns
dat.m <- melt(dat, measure.vars = 1:3)

#Tabulate counts
counts <- as.data.frame(table(dat.m$value))

#I'm not sure what to call this column since it's a smooshing of column names
otherColumn <- ddply(dat.m, "value", function(x) paste(x$variable, collapse = " "))

#Merge the two together. You could fix the column names above, or just deal with it here
merge(counts, otherColumn, by.x = "Var1", by.y = "value")

Gives:

> merge(counts, otherColumn, by.x = "Var1", by.y = "value")
  Var1 Freq                V1
1    a    1             gene1
2    b    1             gene1
3    c    2       gene1 gene3
4    d    3 gene1 gene2 gene3
....
like image 137
Chase Avatar answered Oct 07 '22 00:10

Chase