Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split a data.table by groups and use subset by occourences in a columns?

I have a large dataset, 287046 x 18, that looks like this (only a partial representation):

tdf
         geneSymbol     peaks
16         AK056486 Pol2_only
13         AK310751   no_peak
7          BC036251   no_peak
10         DQ575786   no_peak
4          DQ597235   no_peak
5          DQ599768   no_peak
11         DQ599872   no_peak
12         DQ599872   no_peak
2           FAM138F   no_peak
15           FAM41C   no_peak
34116         GAPDH      both
283034        GAPDH Pol2_only
6      LOC100132062   no_peak
9      LOC100133331   no_peak
14     LOC100288069      both
8            M37726   no_peak
3             OR4F5   no_peak
17           SAMD11      both
18           SAMD11      both
19           SAMD11      both
20           SAMD11      both
21           SAMD11      both
22           SAMD11      both
23           SAMD11      both
24           SAMD11      both
25           SAMD11      both
1            WASH7P Pol2_only

What I want to do is extract (1) the geneSymbols that are either "Pol2_only" or "both" and then; (2) just the geneSymbols that are "Pol2_only" but not "both". For example, GAPDH would fulfil condition 1 but not 2.

I've tried plyr with something like this (there is an extra condition there, please ignore):

## grab genes with both peaks 
pol2.peaks <- ddply(filem, .(geneSymbol), function(dfrm) subset(dfrm, peaks == "both" | (peaks == "Pol2_only" & peaks == "CBP20_only")), .parallel=TRUE)

## grab genes pol2 only peaks 
pol2.only.peaks <- ddply(tdf, .(geneSymbol), function(dfrm) subset(dfrm, peaks != "both" & peaks == "Pol2_only" & peaks != "CBP20_only"), .parallel=TRUE)

But it takes a long time and still returns the wrong answer. For instance, the answer for 2 is:

pol2.only.peaks
  geneSymbol     peaks
1   AK056486 Pol2_only
2      GAPDH Pol2_only
3     WASH7P Pol2_only

As you can see GAPDH should not be there. My implementation in data.table (which is much prefer and thus preferred) also yields the same result:

filem.dt <- as.data.table(tdf)
setkey(filem.dt, "geneSymbol")
test.dt <- filem.dt[ , .SD[ peaks != "both" & peaks == "Pol2_only" & peaks != "CBP20_only"]]
test.dt
   geneSymbol     peaks
1:   AK056486 Pol2_only
2:      GAPDH Pol2_only
3:     WASH7P Pol2_only

The issue seems to be that the subsetting is working on a row-by-row basis whereas, I need it to be applied on the subgroup of geneSymbol as a whole.

Could please help me subset on the group? A data.table solution would be welcome because it is faster but plyr (or even base R) is fine. A solution that adds an extra column noting the nature of the peak would be perfect. This is what I mean:

tdf
         geneSymbol     peaks      newCol
16         AK056486 Pol2_only   Pol2_only
13         AK310751   no_peak     no_peak
7          BC036251   no_peak     no_peak
10         DQ575786   no_peak     no_peak
4          DQ597235   no_peak     no_peak
5          DQ599768   no_peak     no_peak
11         DQ599872   no_peak     no_peak
12         DQ599872   no_peak     no_peak
2           FAM138F   no_peak     no_peak
15           FAM41C   no_peak     no_peak
34116         GAPDH      both        both
283034        GAPDH Pol2_only        both
6      LOC100132062   no_peak     no_peak
9      LOC100133331   no_peak     no_peak
14     LOC100288069      both        both
8            M37726   no_peak     no_peak
3             OR4F5   no_peak     no_peak
17           SAMD11      both        both
18           SAMD11      both        both
19           SAMD11      both        both
20           SAMD11      both        both
21           SAMD11      both        both
22           SAMD11      both        both
23           SAMD11      both        both
24           SAMD11      both        both
25           SAMD11      both        both
1            WASH7P Pol2_only   Pol2_only

Notice again the GAPDH that is now "both" in the 2 rows. Here is the data:

dput(tdf)
structure(list(geneSymbol = c("AK056486", "AK310751", "BC036251", 
"DQ575786", "DQ597235", "DQ599768", "DQ599872", "DQ599872", "FAM138F", 
"FAM41C", "GAPDH", "GAPDH", "LOC100132062", "LOC100133331", "LOC100288069", 
"M37726", "OR4F5", "SAMD11", "SAMD11", "SAMD11", "SAMD11", "SAMD11", 
"SAMD11", "SAMD11", "SAMD11", "SAMD11", "WASH7P"), peaks = c("Pol2_only", 
"no_peak", "no_peak", "no_peak", "no_peak", "no_peak", "no_peak", 
"no_peak", "no_peak", "no_peak", "both", "Pol2_only", "no_peak", 
"no_peak", "both", "no_peak", "no_peak", "both", "both", "both", 
"both", "both", "both", "both", "both", "both", "Pol2_only")), .Names = c("geneSymbol", 
"peaks"), row.names = c(16L, 13L, 7L, 10L, 4L, 5L, 11L, 12L, 
2L, 15L, 34116L, 283034L, 6L, 9L, 14L, 8L, 3L, 17L, 18L, 19L, 
20L, 21L, 22L, 23L, 24L, 25L, 1L), class = "data.frame")

Thank you!

edit ** I've found a workaround for the problem. The selection was being done row-by-row. All it is needed is a hack, that is, that in the logical vector that is returned ALL values are true. So here is what I did with the plyr function:

ddply(tdf, .(geneSymbol), function(dfrm) subset(dfrm, all(peaks != "both" & peaks == "Pol2_only" & peaks != "CBP20_only")), .parallel=TRUE)
  geneSymbol     peaks
1   AK056486 Pol2_only
2     WASH7P Pol2_only

Note the use of all in alongside the conditions. Now the results is the expected, that is, "Pol2_only" only (redundancy alert) genes :) What is still left to be done is the implementation in data.table which I tried but failed to do. Any help?

I have not written an answer to my question in expectation that someone comes along with a better solution in data.table.

like image 394
fridaymeetssunday Avatar asked Oct 03 '22 09:10

fridaymeetssunday


1 Answers

As you requested a data.table solution.

# set the key to be "peaks
TDF <- data.table(tdf, key = c('geneSymbol','peaks'))

# use unique to get unique combinations, then for each geneSymbol get the first
# match (we have keyed by peak soboth < Pol2_only < no_peak within each 
# geneSymbol )
# then exclude those with "peak == "no_peak")

unique(TDF)[.(unique(geneSymbol)), mult = 'first'][!peaks =='no_peak']

#      geneSymbol     peaks
# 1:     AK056486 Pol2_only
# 2:        GAPDH      both
# 3: LOC100288069      both
# 4:       SAMD11      both
# 5:       WASH7P Pol2_only
like image 77
mnel Avatar answered Oct 07 '22 20:10

mnel