I have a dataframe that looks like this (this is just a subset, actually dataset has 2724098 rows) <pre class="prettyprint"><code>> head(dat) chr start end enhancer motif chr10 238000 238600 9_EnhA1 GATA6 chr10 238000 238600 9_EnhA1 GATA4 chr10 238000 238600 9_EnhA1 SRF chr10 238000 238600 9_EnhA1 MEF2A chr10 375200 375400 9_EnhA1 GATA6 chr10 375200 375400 9_EnhA1 GATA4 chr10 440400 441000 9_EnhA1 GATA6 chr10 440400 441000 9_EnhA1 GATA4 chr10 440400 441000 9_EnhA1 SRF chr10 440400 441000 9_EnhA1 MEF2A chr10 441600 442000 9_EnhA1 SRF chr10 441600 442000 9_EnhA1 MEF2A </code></pre> I was able to transform my dataset to this format where groups of chr, start, end and enhancer represent a single ID: <pre class="prettyprint"><code>> dat id motif 1 GATA6 1 GATA4 1 SRF 1 MEF2A 2 GATA6 2 GATA4 3 GATA6 3 GATA4 3 SRF 3 MEF2A 4 SRF 4 MEF2A </code></pre> I want to find the count of every possible pair of motifs, grouped by id. So I want an output table like, <pre class="prettyprint"><code>motif1 motif2 count GATA6 GATA4 3 GATA6 SRF 2 GATA6 MEF2A 2 ... and so on for each pair of motif </code></pre> In the actual dataset, there are 1716 unique motifs. There are 83509 unique id. Any suggestions on how to proceed?

Here is a sparse matrix technique shamelessly borrowed from this question. <pre class="prettyprint"><code># Create an id dat$id <- as.factor(paste(dat$chr, dat$start, dat$end, dat$enhancer)) # Create the sparse matrix. library(Matrix) s <- sparseMatrix( as.numeric(dat$id), as.numeric(dat$motif), dimnames = list(levels(dat$id),levels(dat$motif)), x = TRUE) co.oc <- t(s) %*% s # Find co-occurrences. tab <- summary(co.oc) # Create triplet representation. tab <- tab[tab$i < tab$j,] # Extract upper triangle of matrix data.frame(motif1 = levels(dat$motif)[tab$i], motif2 = levels(dat$motif)[tab$j], number = tab$x) # motif1 motif2 number # 1 GATA4 GATA6 3 # 2 GATA4 MEF2A 2 # 3 GATA6 MEF2A 2 # 4 GATA4 SRF 2 # 5 GATA6 SRF 2 # 6 MEF2A SRF 3 </code></pre>

Count every possible pair of values in a column grouped by multiple columns

Tags:

r

I have a dataframe that looks like this (this is just a subset, actually dataset has 2724098 rows)

> head(dat)

chr   start  end    enhancer motif 
chr10 238000 238600 9_EnhA1  GATA6 
chr10 238000 238600 9_EnhA1  GATA4 
chr10 238000 238600 9_EnhA1    SRF 
chr10 238000 238600 9_EnhA1  MEF2A 
chr10 375200 375400 9_EnhA1  GATA6 
chr10 375200 375400 9_EnhA1  GATA4 
chr10 440400 441000 9_EnhA1  GATA6 
chr10 440400 441000 9_EnhA1  GATA4 
chr10 440400 441000 9_EnhA1    SRF 
chr10 440400 441000 9_EnhA1  MEF2A 
chr10 441600 442000 9_EnhA1    SRF 
chr10 441600 442000 9_EnhA1  MEF2A

I was able to transform my dataset to this format where groups of chr, start, end and enhancer represent a single ID:

> dat

 id motif 
 1  GATA6 
 1  GATA4 
 1    SRF 
 1  MEF2A 
 2  GATA6 
 2  GATA4
 3  GATA6 
 3  GATA4 
 3    SRF 
 3  MEF2A 
 4    SRF 
 4  MEF2A

I want to find the count of every possible pair of motifs, grouped by id. So I want an output table like,

motif1 motif2 count
 GATA6  GATA4     3
 GATA6    SRF     2
 GATA6  MEF2A     2
 ... and so on for each pair of motif

In the actual dataset, there are 1716 unique motifs. There are 83509 unique id.

Any suggestions on how to proceed?

493

asked Oct 07 '14 20:10

Komal Rathi

2 Answers

Updated: Here is a fast and memory efficient version using data.table:

Step 1: Construct sample data of your dimensions approximately:

require(data.table) ## 1.9.4+
set.seed(1L)        ## For reproducibility
N = 2724098L
motif = sample(paste("motif", 1:1716, sep="_"), N, TRUE)
id = sample(83509, N, TRUE)
DT = data.table(id, motif)

Step 2: Pre-processing:

DT = unique(DT) ## IMPORTANT: not to have duplicate motifs within same id
setorder(DT)    ## IMPORTANT: motifs are ordered within id as well
setkey(DT, id)  ## reset key to 'id'. Motifs ordered within id from previous step
DT[, runlen := .I]

Step 3: Solution:

ans = DT[DT, {
              tmp = runlen < i.runlen; 
              list(motif[tmp], i.motif[any(tmp)])
             }, 
      by=.EACHI][, .N, by="V1,V2"]

This takes ~27 seconds and ~1GB of memory during the final step 3.

The idea is to perform a self-join, but make use of data.table's by=.EACHI feature, which evaluates the j-expression for each i, and therefore memory efficient. And the j-expression makes sure that we only obtain the entry "motif_a, motif_b" and not the redundant "motif_b,motif_a". This saves computation time and memory as well. And the binary search is quite fast, even though there are 87K+ ids. Finally we aggregate by the motif combinations to get the number of rows in each of them - which is what you require.

HTH

PS: See revision for the older (+ slower) version.

112

answered Oct 27 '22 20:10

Arun

Here is a sparse matrix technique shamelessly borrowed from this question.

# Create an id
dat$id <- as.factor(paste(dat$chr, dat$start, dat$end, dat$enhancer))

# Create the sparse matrix.
library(Matrix)
s <- sparseMatrix(
      as.numeric(dat$id), 
      as.numeric(dat$motif),
      dimnames = list(levels(dat$id),levels(dat$motif)),
  x = TRUE)

co.oc <- t(s) %*% s # Find co-occurrences.
tab <- summary(co.oc) # Create triplet representation.
tab <- tab[tab$i < tab$j,] # Extract upper triangle of matrix

data.frame(motif1 = levels(dat$motif)[tab$i],
           motif2 = levels(dat$motif)[tab$j],
           number = tab$x)

#    motif1 motif2 number
# 1  GATA4  GATA6      3
# 2  GATA4  MEF2A      2
# 3  GATA6  MEF2A      2
# 4  GATA4    SRF      2
# 5  GATA6    SRF      2
# 6  MEF2A    SRF      3

answered Oct 27 '22 19:10

nograpes

Related questions
                            
                                R regular expression: isolate a string between quotes
                            
                                Subtract every element of vector A from every element of vector B
                            
                                Divide row value by aggregated sum in R data.frame
                            
                                Using Tidyr/Dplyr to summarise counts of groups of strings
                            
                                How to plot an exponential distribution
                            
                                Last name, First Name to First Name Last Name
                            
                                Checking if a string contains all blankspace in R
                            
                                Subset by consecutive years in a data.frame
                            
                                How to export many variables and functions from global environment to foreach loop?
                            
                                Comparing variables of with a vector of solutions
                            
                                Sorting data in R
                            
                                Performing an if statement on each row in R
                            
                                Split column containing email addresses
                            
                                Convert tapply output to data frame in R
                            
                                substitute letters with corresponding set of letters
                            
                                How to compare boolean vectors in R
                            
                                aggregate a column by sum and another column by mean at the same time
                            
                                Reshape data from long to a short format by a variable, and rename columns
                            
                                data.table join and j-expression unexpected behavior
                            
                                R igraph - Convert a weighted adjacency matrix into weighted edgelist

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With