Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R For a list Generate all combinations of factor, all combinations of merges, and combine

So i am working with cancer stage data. Assume a dataset of this type. Its a dataframe.

   cancertype     stage
TCGA-67-6215-01     1
TCGA-67-6216-01     1
TCGA-67-6217-01     2
TCGA-69-7760-01     2
TCGA-69-7761-01     1
TCGA-69-7763-01     1
TCGA-69-7764-01     1
TCGA-69-7765-01     4
TCGA-69-7980-01     1
TCGA-71-6725-01     1
TCGA-73-4658-01     1
TCGA-73-4659-01     3
TCGA-73-4662-01     1
TCGA-73-4675-01     3

So what I want is a list where each element is a dataframe. Here there are 4 levels for the 4 possible cancer stages. There should be dataframe for each combo of 2 levels, of 3 levels,etc up to the # of levels in the data. But also a dataframe for each combination of merged levels. What i mean is

list(
dataframe of stage1 and 2
dataframe of stage1 and 3
dataframe of stage 1 and 4
dataframe of stage 2 and 3
...etc
dataframe of stage 1,2 and 3
dataframe of stage 2,3 and 4
...
dataframe of stage 1,2 and 3,4
dataframe of stage 1,3 and 2,4
dataframe of stage 1,2,3 and 4
dataframe of stage 1,2,4 and 3
.. etc etc I think this should give you the idea.
)

Here when I say stage 1,2,4 I mean they have all been merged to one level.

I basically am trying to do every possible comparison of a t-test, so I am setting up the samples I will need for this comparison. It would be nice to just do every possible combo and merge combo.

Where I am so far, I am able to combine all the elements of unmerged comparisons which is 11 . i.e. 6 combos of 2 stages, 4 combos of 3 stages, 1 combo of 4 stages

stage # dataframe of stage data as factors
stage_split <-split(stage,stage[,1])
allcombos<- c(combn(stage_split,2,simplify=F), combn(stage_split,3,simplify=F), combn(stage_split,4,simplify=F))
allcombos_cmbnd<- lapply(allcombos, function(x) Reduce(rbind,x))

How do I do I generate the additional dataframes from all the possible merge permutations, and then append to this list? Maybe there is an elegant way from the first dataframe to accomplish this. One way would be to iterate through this list of 11 and generate merge starting at combos of 3? I could brute force it but I am hoping there is an elegant way to perform this that could be scaled up. Nothing I found so far explains how to do generate all combinations of levels in your data and all merge combinations of your levels.

Thanks for any help

like image 988
SuperCal123 Avatar asked Oct 31 '22 22:10

SuperCal123


1 Answers

When you are grouping the stages together, you are partitioning sets of size 3 or 4. There is a package, partitions that implements set partitioning with setparts. Here I focus on that merging part, since it sounds like you already figured out the non-merged grouping.

 ## For unmerged, get groupings with something like this
combos <- unlist(lapply(2:4, function(x) combn(unique(dat$stage), x, simplify=F)), rec=F)

## For merged groupings, use set partitioning
library(partitions)
dats <- unlist(sapply(3:4, function(p) {
    parts <- setparts(p)  # set partitions of size p
    lst <- lapply(split(parts, col(parts)), function(idx) {
        if (p==3) {       # with sets of 3, need to exclude one of the stages
            subLst <- lapply(1:4, function(exclude) {
                tmp <- dat$stage
                tmp[dat$stage==exclude] <- NA
                ids <- seq(4)[-exclude]
                for (i in 1:3) tmp[dat$stage==ids[i]] <- idx[i]
                data.frame(dat$cancertype, stage=tmp)
            })
            names(subLst) <- paste(1:4)
            subLst
        } else {          # sets of 4, no need to exclude
            tmp <- dat$stage
            for (i in 1:length(idx)) tmp[dat$stage==i] <- idx[i]
            data.frame(dat$cancertype, stage=tmp)
        }
    })
    names(lst) <- lapply(split(parts, col(parts)), paste, collapse=".")
    lst
}), rec=F)

dats is now a list of data.frames with the stages grouped by the set partitions. When partitioning sets of size 3, one of the stages had to be removed. So, those entries in dats appear as lists of length four, each element corresponding to removing a stage from consideration (the lists are ordered, so the first component removes stage 1, the second removes stage 2, etc). Lets look at one or the size 3 partitions,

dats[4]
$`2.1.1`
# $`2.1.1`$`1`
#     dat.cancertype stage
# 1  TCGA-67-6215-01    NA
# 2  TCGA-67-6216-01    NA
# 3  TCGA-67-6217-01     2
# 4  TCGA-69-7760-01     2
# 5  TCGA-69-7761-01    NA
# 6  TCGA-69-7763-01    NA
# 7  TCGA-69-7764-01    NA
# 8  TCGA-69-7765-01     1
# 9  TCGA-69-7980-01    NA
# 10 TCGA-71-6725-01    NA
# 11 TCGA-73-4658-01    NA
# 12 TCGA-73-4659-01     1
# 13 TCGA-73-4662-01    NA
# 14 TCGA-73-4675-01     1
# 
# $`2.1.1`$`2`
#     dat.cancertype stage
# 1  TCGA-67-6215-01     2
# 2  TCGA-67-6216-01     2
# 3  TCGA-67-6217-01    NA
# 4  TCGA-69-7760-01    NA
# 5  TCGA-69-7761-01     2
# 6  TCGA-69-7763-01     2
# 7  TCGA-69-7764-01     2
# 8  TCGA-69-7765-01     1
# 9  TCGA-69-7980-01     2
# 10 TCGA-71-6725-01     2
# 11 TCGA-73-4658-01     2
# 12 TCGA-73-4659-01     1
# 13 TCGA-73-4662-01     2
# 14 TCGA-73-4675-01     1

The naming convention here is group1.group2.group3$excludedGroup, and identical numbers means the groups have been merged. So, 2.1.1$1 means the first group has been excluded ($1, actually just converted to NA) and in the remaining data, groups 2 and 3 have been combined. It's a bit confusing and a better naming scheme is probably need. As an example, $2.1.1$1 means "stage 1 is excluded (NA) and stage 3 and stage 4 have been combined". So, I could access that data with dats[['2.1.1']][['1']]. There are two more data.frames in this list that aren't shown (those excluding stages 3 and 4).

Now, the set-4 partitions are more straightforward since there were no exclusions. For example,

dats[19]
# $`2.3.1.1`
#     dat.cancertype stage
# 1  TCGA-67-6215-01     2
# 2  TCGA-67-6216-01     2
# 3  TCGA-67-6217-01     3
# 4  TCGA-69-7760-01     3
# 5  TCGA-69-7761-01     2
# 6  TCGA-69-7763-01     2
# 7  TCGA-69-7764-01     2
# 8  TCGA-69-7765-01     1
# 9  TCGA-69-7980-01     2
# 10 TCGA-71-6725-01     2
# 11 TCGA-73-4658-01     2
# 12 TCGA-73-4659-01     1
# 13 TCGA-73-4662-01     2
# 14 TCGA-73-4675-01     1

The naming here is "Group1.Group2.Group3.Group4". In this grouping stage 3 and 4 have been merged for example (both == 1).

There are redundancies here, you could either go with partitioning sets or size 3 with exclusion or partitioning sets of size 4 and doing multiple comparisons on each data.frame. For example, of the datasets shown above, equivalent tests can be done using dats[['2.3.1.1']] or both dats[['2.1.1']][['1']] and dats[['2.1.1']][['2']] combined.

To simplify things, instead of storing all these data.frames in a list you could just store the indices, or just do your calculations in the loop.

like image 159
Rorschach Avatar answered Nov 09 '22 14:11

Rorschach