I have a text file that looks like this:
gene1 gene2 gene3
a d c
b e d
c f g
d g
h
i
(Each column is a human gene, and each contains a variable number of proteins (strings, shown as letters here) that can bind to those genes).
What I want to do is count how many columns each string is represented in, output that number and all the column headers, like this:
a 1 gene1
b 1 gene1
c 2 gene1 gene3
d 3 gene1 gene2 gene3
e 1 gene2
f 1 gene2
g 2 gene2 gene3
h 1 gene2
i 1 gene2
I have been trying to figure out how to do this in Perl and R, but without success so far. Thanks for any help.
This solution seems like a bit of a hack, but it gives the desired output. It relies on using both plyr
and reshape
packages, though I'm sure you could find base R alternatives. The trick is that function melt
lets us flatten the data out into a long format, which allows for easy(ish) manipulation from that point forward.
library(reshape)
library(plyr)
#Recreate your data
dat <- data.frame(gene1 = c(letters[1:4], NA, NA),
gene2 = letters[4:9],
gene3 = c("c", "d", "g", NA, NA, NA)
)
#Melt the data. You'll need to update this if you have more columns
dat.m <- melt(dat, measure.vars = 1:3)
#Tabulate counts
counts <- as.data.frame(table(dat.m$value))
#I'm not sure what to call this column since it's a smooshing of column names
otherColumn <- ddply(dat.m, "value", function(x) paste(x$variable, collapse = " "))
#Merge the two together. You could fix the column names above, or just deal with it here
merge(counts, otherColumn, by.x = "Var1", by.y = "value")
Gives:
> merge(counts, otherColumn, by.x = "Var1", by.y = "value")
Var1 Freq V1
1 a 1 gene1
2 b 1 gene1
3 c 2 gene1 gene3
4 d 3 gene1 gene2 gene3
....
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With