I have a large dataset, 287046 x 18, that looks like this (only a partial representation):
tdf
geneSymbol peaks
16 AK056486 Pol2_only
13 AK310751 no_peak
7 BC036251 no_peak
10 DQ575786 no_peak
4 DQ597235 no_peak
5 DQ599768 no_peak
11 DQ599872 no_peak
12 DQ599872 no_peak
2 FAM138F no_peak
15 FAM41C no_peak
34116 GAPDH both
283034 GAPDH Pol2_only
6 LOC100132062 no_peak
9 LOC100133331 no_peak
14 LOC100288069 both
8 M37726 no_peak
3 OR4F5 no_peak
17 SAMD11 both
18 SAMD11 both
19 SAMD11 both
20 SAMD11 both
21 SAMD11 both
22 SAMD11 both
23 SAMD11 both
24 SAMD11 both
25 SAMD11 both
1 WASH7P Pol2_only
What I want to do is extract (1) the geneSymbols that are either "Pol2_only" or "both" and then; (2) just the geneSymbols that are "Pol2_only" but not "both". For example, GAPDH would fulfil condition 1 but not 2.
I've tried plyr with something like this (there is an extra condition there, please ignore):
## grab genes with both peaks
pol2.peaks <- ddply(filem, .(geneSymbol), function(dfrm) subset(dfrm, peaks == "both" | (peaks == "Pol2_only" & peaks == "CBP20_only")), .parallel=TRUE)
## grab genes pol2 only peaks
pol2.only.peaks <- ddply(tdf, .(geneSymbol), function(dfrm) subset(dfrm, peaks != "both" & peaks == "Pol2_only" & peaks != "CBP20_only"), .parallel=TRUE)
But it takes a long time and still returns the wrong answer. For instance, the answer for 2 is:
pol2.only.peaks
geneSymbol peaks
1 AK056486 Pol2_only
2 GAPDH Pol2_only
3 WASH7P Pol2_only
As you can see GAPDH should not be there. My implementation in data.table (which is much prefer and thus preferred) also yields the same result:
filem.dt <- as.data.table(tdf)
setkey(filem.dt, "geneSymbol")
test.dt <- filem.dt[ , .SD[ peaks != "both" & peaks == "Pol2_only" & peaks != "CBP20_only"]]
test.dt
geneSymbol peaks
1: AK056486 Pol2_only
2: GAPDH Pol2_only
3: WASH7P Pol2_only
The issue seems to be that the subsetting is working on a row-by-row basis whereas, I need it to be applied on the subgroup of geneSymbol as a whole.
Could please help me subset on the group? A data.table solution would be welcome because it is faster but plyr (or even base R) is fine. A solution that adds an extra column noting the nature of the peak would be perfect. This is what I mean:
tdf
geneSymbol peaks newCol
16 AK056486 Pol2_only Pol2_only
13 AK310751 no_peak no_peak
7 BC036251 no_peak no_peak
10 DQ575786 no_peak no_peak
4 DQ597235 no_peak no_peak
5 DQ599768 no_peak no_peak
11 DQ599872 no_peak no_peak
12 DQ599872 no_peak no_peak
2 FAM138F no_peak no_peak
15 FAM41C no_peak no_peak
34116 GAPDH both both
283034 GAPDH Pol2_only both
6 LOC100132062 no_peak no_peak
9 LOC100133331 no_peak no_peak
14 LOC100288069 both both
8 M37726 no_peak no_peak
3 OR4F5 no_peak no_peak
17 SAMD11 both both
18 SAMD11 both both
19 SAMD11 both both
20 SAMD11 both both
21 SAMD11 both both
22 SAMD11 both both
23 SAMD11 both both
24 SAMD11 both both
25 SAMD11 both both
1 WASH7P Pol2_only Pol2_only
Notice again the GAPDH that is now "both" in the 2 rows. Here is the data:
dput(tdf)
structure(list(geneSymbol = c("AK056486", "AK310751", "BC036251",
"DQ575786", "DQ597235", "DQ599768", "DQ599872", "DQ599872", "FAM138F",
"FAM41C", "GAPDH", "GAPDH", "LOC100132062", "LOC100133331", "LOC100288069",
"M37726", "OR4F5", "SAMD11", "SAMD11", "SAMD11", "SAMD11", "SAMD11",
"SAMD11", "SAMD11", "SAMD11", "SAMD11", "WASH7P"), peaks = c("Pol2_only",
"no_peak", "no_peak", "no_peak", "no_peak", "no_peak", "no_peak",
"no_peak", "no_peak", "no_peak", "both", "Pol2_only", "no_peak",
"no_peak", "both", "no_peak", "no_peak", "both", "both", "both",
"both", "both", "both", "both", "both", "both", "Pol2_only")), .Names = c("geneSymbol",
"peaks"), row.names = c(16L, 13L, 7L, 10L, 4L, 5L, 11L, 12L,
2L, 15L, 34116L, 283034L, 6L, 9L, 14L, 8L, 3L, 17L, 18L, 19L,
20L, 21L, 22L, 23L, 24L, 25L, 1L), class = "data.frame")
Thank you!
edit ** I've found a workaround for the problem. The selection was being done row-by-row. All it is needed is a hack, that is, that in the logical vector that is returned ALL values are true. So here is what I did with the plyr function:
ddply(tdf, .(geneSymbol), function(dfrm) subset(dfrm, all(peaks != "both" & peaks == "Pol2_only" & peaks != "CBP20_only")), .parallel=TRUE)
geneSymbol peaks
1 AK056486 Pol2_only
2 WASH7P Pol2_only
Note the use of all in alongside the conditions. Now the results is the expected, that is, "Pol2_only" only (redundancy alert) genes :) What is still left to be done is the implementation in data.table which I tried but failed to do. Any help?
I have not written an answer to my question in expectation that someone comes along with a better solution in data.table.
As you requested a data.table solution.
# set the key to be "peaks
TDF <- data.table(tdf, key = c('geneSymbol','peaks'))
# use unique to get unique combinations, then for each geneSymbol get the first
# match (we have keyed by peak soboth < Pol2_only < no_peak within each
# geneSymbol )
# then exclude those with "peak == "no_peak")
unique(TDF)[.(unique(geneSymbol)), mult = 'first'][!peaks =='no_peak']
# geneSymbol peaks
# 1: AK056486 Pol2_only
# 2: GAPDH both
# 3: LOC100288069 both
# 4: SAMD11 both
# 5: WASH7P Pol2_only
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With