I have a dataframe that looks like this (this is just a subset, actually dataset has 2724098 rows)
> head(dat)
chr start end enhancer motif
chr10 238000 238600 9_EnhA1 GATA6
chr10 238000 238600 9_EnhA1 GATA4
chr10 238000 238600 9_EnhA1 SRF
chr10 238000 238600 9_EnhA1 MEF2A
chr10 375200 375400 9_EnhA1 GATA6
chr10 375200 375400 9_EnhA1 GATA4
chr10 440400 441000 9_EnhA1 GATA6
chr10 440400 441000 9_EnhA1 GATA4
chr10 440400 441000 9_EnhA1 SRF
chr10 440400 441000 9_EnhA1 MEF2A
chr10 441600 442000 9_EnhA1 SRF
chr10 441600 442000 9_EnhA1 MEF2A
I was able to transform my dataset to this format where groups of chr, start, end and enhancer represent a single ID:
> dat
id motif
1 GATA6
1 GATA4
1 SRF
1 MEF2A
2 GATA6
2 GATA4
3 GATA6
3 GATA4
3 SRF
3 MEF2A
4 SRF
4 MEF2A
I want to find the count of every possible pair of motifs, grouped by id. So I want an output table like,
motif1 motif2 count
GATA6 GATA4 3
GATA6 SRF 2
GATA6 MEF2A 2
... and so on for each pair of motif
In the actual dataset, there are 1716 unique motifs. There are 83509 unique id.
Any suggestions on how to proceed?
We can group the resultset in SQL on multiple column values. When we define the grouping criteria on more than one column, all the records having the same value for the columns defined in the group by clause are collectively represented using a single record in the query output.
To groupby columns and count the occurrences of each combination in Pandas, we use the DataFrame. groupby() with size(). The groupby() method separates the DataFrame into groups.
Updated: Here is a fast and memory efficient version using data.table
:
Step 1: Construct sample data of your dimensions approximately:
require(data.table) ## 1.9.4+
set.seed(1L) ## For reproducibility
N = 2724098L
motif = sample(paste("motif", 1:1716, sep="_"), N, TRUE)
id = sample(83509, N, TRUE)
DT = data.table(id, motif)
Step 2: Pre-processing:
DT = unique(DT) ## IMPORTANT: not to have duplicate motifs within same id
setorder(DT) ## IMPORTANT: motifs are ordered within id as well
setkey(DT, id) ## reset key to 'id'. Motifs ordered within id from previous step
DT[, runlen := .I]
Step 3: Solution:
ans = DT[DT, {
tmp = runlen < i.runlen;
list(motif[tmp], i.motif[any(tmp)])
},
by=.EACHI][, .N, by="V1,V2"]
This takes ~27 seconds and ~1GB of memory during the final step 3.
The idea is to perform a self-join, but make use of data.table's by=.EACHI
feature, which evaluates the j-expression
for each i
, and therefore memory efficient. And the j-expression
makes sure that we only obtain the entry "motif_a, motif_b" and not the redundant "motif_b,motif_a". This saves computation time and memory as well. And the binary search is quite fast, even though there are 87K+ ids. Finally we aggregate by the motif combinations to get the number of rows in each of them - which is what you require.
HTH
PS: See revision for the older (+ slower) version.
Here is a sparse matrix technique shamelessly borrowed from this question.
# Create an id
dat$id <- as.factor(paste(dat$chr, dat$start, dat$end, dat$enhancer))
# Create the sparse matrix.
library(Matrix)
s <- sparseMatrix(
as.numeric(dat$id),
as.numeric(dat$motif),
dimnames = list(levels(dat$id),levels(dat$motif)),
x = TRUE)
co.oc <- t(s) %*% s # Find co-occurrences.
tab <- summary(co.oc) # Create triplet representation.
tab <- tab[tab$i < tab$j,] # Extract upper triangle of matrix
data.frame(motif1 = levels(dat$motif)[tab$i],
motif2 = levels(dat$motif)[tab$j],
number = tab$x)
# motif1 motif2 number
# 1 GATA4 GATA6 3
# 2 GATA4 MEF2A 2
# 3 GATA6 MEF2A 2
# 4 GATA4 SRF 2
# 5 GATA6 SRF 2
# 6 MEF2A SRF 3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With