Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count every possible pair of values in a column grouped by multiple columns

Tags:

r

I have a dataframe that looks like this (this is just a subset, actually dataset has 2724098 rows)

> head(dat)

chr   start  end    enhancer motif 
chr10 238000 238600 9_EnhA1  GATA6 
chr10 238000 238600 9_EnhA1  GATA4 
chr10 238000 238600 9_EnhA1    SRF 
chr10 238000 238600 9_EnhA1  MEF2A 
chr10 375200 375400 9_EnhA1  GATA6 
chr10 375200 375400 9_EnhA1  GATA4 
chr10 440400 441000 9_EnhA1  GATA6 
chr10 440400 441000 9_EnhA1  GATA4 
chr10 440400 441000 9_EnhA1    SRF 
chr10 440400 441000 9_EnhA1  MEF2A 
chr10 441600 442000 9_EnhA1    SRF 
chr10 441600 442000 9_EnhA1  MEF2A 

I was able to transform my dataset to this format where groups of chr, start, end and enhancer represent a single ID:

> dat

 id motif 
 1  GATA6 
 1  GATA4 
 1    SRF 
 1  MEF2A 
 2  GATA6 
 2  GATA4
 3  GATA6 
 3  GATA4 
 3    SRF 
 3  MEF2A 
 4    SRF 
 4  MEF2A 

I want to find the count of every possible pair of motifs, grouped by id. So I want an output table like,

motif1 motif2 count
 GATA6  GATA4     3
 GATA6    SRF     2
 GATA6  MEF2A     2
 ... and so on for each pair of motif

In the actual dataset, there are 1716 unique motifs. There are 83509 unique id.

Any suggestions on how to proceed?

like image 493
Komal Rathi Avatar asked Oct 07 '14 20:10

Komal Rathi


People also ask

Can you group by multiple columns in SQL?

We can group the resultset in SQL on multiple column values. When we define the grouping criteria on more than one column, all the records having the same value for the columns defined in the group by clause are collectively represented using a single record in the query output.

What provides the occurrences for the combination of values for two columns?

To groupby columns and count the occurrences of each combination in Pandas, we use the DataFrame. groupby() with size(). The groupby() method separates the DataFrame into groups.


2 Answers

Updated: Here is a fast and memory efficient version using data.table:

  • Step 1: Construct sample data of your dimensions approximately:

    require(data.table) ## 1.9.4+
    set.seed(1L)        ## For reproducibility
    N = 2724098L
    motif = sample(paste("motif", 1:1716, sep="_"), N, TRUE)
    id = sample(83509, N, TRUE)
    DT = data.table(id, motif)
    
  • Step 2: Pre-processing:

    DT = unique(DT) ## IMPORTANT: not to have duplicate motifs within same id
    setorder(DT)    ## IMPORTANT: motifs are ordered within id as well
    setkey(DT, id)  ## reset key to 'id'. Motifs ordered within id from previous step
    DT[, runlen := .I]
    
  • Step 3: Solution:

    ans = DT[DT, {
                  tmp = runlen < i.runlen; 
                  list(motif[tmp], i.motif[any(tmp)])
                 }, 
          by=.EACHI][, .N, by="V1,V2"]
    

    This takes ~27 seconds and ~1GB of memory during the final step 3.

The idea is to perform a self-join, but make use of data.table's by=.EACHI feature, which evaluates the j-expression for each i, and therefore memory efficient. And the j-expression makes sure that we only obtain the entry "motif_a, motif_b" and not the redundant "motif_b,motif_a". This saves computation time and memory as well. And the binary search is quite fast, even though there are 87K+ ids. Finally we aggregate by the motif combinations to get the number of rows in each of them - which is what you require.

HTH

PS: See revision for the older (+ slower) version.

like image 112
Arun Avatar answered Oct 27 '22 20:10

Arun


Here is a sparse matrix technique shamelessly borrowed from this question.

# Create an id
dat$id <- as.factor(paste(dat$chr, dat$start, dat$end, dat$enhancer))

# Create the sparse matrix.
library(Matrix)
s <- sparseMatrix(
      as.numeric(dat$id), 
      as.numeric(dat$motif),
      dimnames = list(levels(dat$id),levels(dat$motif)),
  x = TRUE)

co.oc <- t(s) %*% s # Find co-occurrences.
tab <- summary(co.oc) # Create triplet representation.
tab <- tab[tab$i < tab$j,] # Extract upper triangle of matrix

data.frame(motif1 = levels(dat$motif)[tab$i],
           motif2 = levels(dat$motif)[tab$j],
           number = tab$x)

#    motif1 motif2 number
# 1  GATA4  GATA6      3
# 2  GATA4  MEF2A      2
# 3  GATA6  MEF2A      2
# 4  GATA4    SRF      2
# 5  GATA6    SRF      2
# 6  MEF2A    SRF      3
like image 33
nograpes Avatar answered Oct 27 '22 19:10

nograpes