I have a pandas dataframe that looks as the following one:
chrom start end probability read
0 chr1 1 10 0.99 read1
1 chr1 5 25 0.99 read2
2 chr1 15 25 0.99 read2
3 chr1 30 40 0.75 read4
What I wanna do is to merge the intervals that have the same chromosome (chrom column), and whose coordinates (start,end) overlap. In some situations, were multiple intervals overlap each other, there will be intervals that should be merged, even though they do not overlap. See row 0 and row 2 in the above mentioned example and the output of the merging below
For those elements that are merged, I want to sum their probabilities (probability column) and count the unique elements in the 'read' column.
Which would lead to the following output using the example above, note that rows 0,1 and 2 have been merged:
chrom start end probability read
0 chr1 1 20 2.97 2
1 chr1 30 40 0.75 1
Up to now, I have been doing this with pybedtools merge, but it has turn out that it is slow for doing it millions of times (my case). Hence, I am looking for other options and pandas is the obvious one. I know that with pandas groupby one can apply different operations to the columns that are going to be merged like nunique and sum, which are the ones that I will need to apply. Nevertheless, pandas groupby only merges data with exact 'chrom', 'start' and 'end' coordinates.
My problem is that I don't know how to use pandas to merge my rows based on the coordinates (chrom,start,end) and then apply the sum and nunique operations.
Is there a fast way of doing this?
thanks!
PS: As I have told on my question, I am doing this millions of times, so speed is a big issue. Hence, I am not able to use pybedtools or pure python, which are too slow for my goal.
Thanks!
As suggested by @root, the accepted answer fails to generalize to similar cases. e.g. if we add an extra row with range 2-3 to the example in the question:
df = pd.DataFrame({'chrom': ['chr1','chr1','chr1','chr1','chr1'],
'start': [1, 2, 5, 15, 30],
'end': [10, 3, 20, 25, 40],
'probability': [0.99, 0.99, 0.99, 0.99, 0.75],
'read': ['read1','read2','read2','read2','read4']})
...the suggested aggregate function outputs the following dataframe. Note that 4 is in the range 1-10, but it is no longer captured. The ranges 1-10, 2-3, 5-20 and 15-25 all overlap and so should be grouped together.
One solution is the following approach (using the aggregate function suggested by @W-B and the method for combining intervals posted by @CentAu).
# Union intervals by @CentAu
from sympy import Interval, Union
def union(data):
""" Union of a list of intervals e.g. [(1,2),(3,4)] """
intervals = [Interval(begin, end) for (begin, end) in data]
u = Union(*intervals)
return [u] if isinstance(u, Interval) \
else list(u.args)
# Get intervals for rows
def f(x,position=None):
"""
Returns an interval for the row. The start and stop position indicate the minimum
and maximum position of all overlapping ranges within the group.
Args:
position (str, optional): Returns an integer indicating start or stop position.
"""
intervals = union(x)
if position and position.lower() == 'start':
group = x.str[0].apply(lambda y: [l.start for g,l in enumerate(intervals) if l.contains(y)][0])
elif position and position.lower() == 'end':
group = x.str[0].apply(lambda y: [l.end for g,l in enumerate(intervals) if l.contains(y)][0])
else:
group = x.str[0].apply(lambda y: [l for g,l in enumerate(intervals) if l.contains(y)][0])
return group
# Combine start and end into a single column
df['start_end'] = df[['start', 'end']].apply(list, axis=1)
# Assign each row to an interval and add start/end columns
df['start_interval'] = df[['chrom',
'start_end']].groupby(['chrom']).transform(f,'start')
df['end_interval'] = df[['chrom',
'start_end']].groupby(['chrom']).transform(f,'end')
# Aggregate rows, using approach by @W-B
df.groupby(['chrom','start_interval','end_interval']).agg({'probability':'sum',
'read':'nunique'}).reset_index()
...which outputs the following dataframe. Summed probability for the first row is 3.96 because we are combining four rows instead of three.
While this approach should be more generalisable, it is not necessarily fast! Hopefully others can suggest speedier alternatives.
IIUC
df.groupby((df.end.shift()-df.start).lt(0).cumsum()).agg({'chrom':'first','start':'first','end':'last','probability':'sum','read':'nunique'})
Out[417]:
chrom start end probability read
0 chr1 1 20 2.97 2
1 chr1 30 40 0.75 1
More info create the group key
(df.end.shift()-df.start).lt(0).cumsum()
Out[418]:
0 0
1 0
2 0
3 1
dtype: int32
Here is an answer using pyranges and pandas. It is improved in that it does the merging really quickly, is easily parallelizeable and super duper fast even in single-core mode.
Setup:
import pandas as pd
import pyranges as pr
import numpy as np
rows = int(1e7)
gr = pr.random(rows)
gr.probability = np.random.rand(rows)
gr.read = np.arange(rows)
print(gr)
# +--------------+-----------+-----------+--------------+----------------------+-----------+
# | Chromosome | Start | End | Strand | probability | read |
# | (category) | (int32) | (int32) | (category) | (float64) | (int64) |
# |--------------+-----------+-----------+--------------+----------------------+-----------|
# | chr1 | 149953099 | 149953199 | + | 0.7536048547309669 | 0 |
# | chr1 | 184344435 | 184344535 | + | 0.9358130407479777 | 1 |
# | chr1 | 238639916 | 238640016 | + | 0.024212603310159064 | 2 |
# | chr1 | 95180042 | 95180142 | + | 0.027139751993808026 | 3 |
# | ... | ... | ... | ... | ... | ... |
# | chrY | 34355323 | 34355423 | - | 0.8843190383030953 | 999996 |
# | chrY | 1818049 | 1818149 | - | 0.23138017743097572 | 999997 |
# | chrY | 10101456 | 10101556 | - | 0.3007915302642412 | 999998 |
# | chrY | 355910 | 356010 | - | 0.03694752911338561 | 999999 |
# +--------------+-----------+-----------+--------------+----------------------+-----------+
# Stranded PyRanges object has 1,000,000 rows and 6 columns from 25 chromosomes.
# For printing, the PyRanges was sorted on Chromosome and Strand.
Execution:
def praderas(df):
grpby = df.groupby("Cluster")
prob = grpby.probability.sum()
prob.name = "ProbSum"
n = grpby.read.count()
n.name = "Count"
return df.merge(prob, on="Cluster").merge(n, on="Cluster")
%time result = gr.cluster().apply(praderas)
# 11.4s !
result[result.Count > 2]
# +--------------+-----------+-----------+--------------+----------------------+-----------+-----------+--------------------+-----------+
# | Chromosome | Start | End | Strand | probability | read | Cluster | ProbSum | Count |
# | (category) | (int32) | (int32) | (category) | (float64) | (int64) | (int32) | (float64) | (int64) |
# |--------------+-----------+-----------+--------------+----------------------+-----------+-----------+--------------------+-----------|
# | chr1 | 52952 | 53052 | + | 0.7411051557901921 | 59695 | 70 | 2.2131010082513884 | 3 |
# | chr1 | 52959 | 53059 | + | 0.9979036360671423 | 356518 | 70 | 2.2131010082513884 | 3 |
# | chr1 | 53029 | 53129 | + | 0.47409221639405397 | 104776 | 70 | 2.2131010082513884 | 3 |
# | chr1 | 64657 | 64757 | + | 0.32465233067499366 | 386140 | 88 | 1.3880589602361695 | 3 |
# | ... | ... | ... | ... | ... | ... | ... | ... | ... |
# | chrY | 59356855 | 59356955 | - | 0.3877207561218887 | 9966373 | 8502533 | 1.182153891322546 | 4 |
# | chrY | 59356865 | 59356965 | - | 0.4007557656399032 | 9907364 | 8502533 | 1.182153891322546 | 4 |
# | chrY | 59356932 | 59357032 | - | 0.33799123310907786 | 9978653 | 8502533 | 1.182153891322546 | 4 |
# | chrY | 59356980 | 59357080 | - | 0.055686136451676305 | 9994845 | 8502533 | 1.182153891322546 | 4 |
# +--------------+-----------+-----------+--------------+----------------------+-----------+-----------+--------------------+-----------+
# Stranded PyRanges object has 606,212 rows and 9 columns from 24 chromosomes.
# For printing, the PyRanges was sorted on Chromosome and Strand.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With