Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas merge intervals by range

I have a pandas dataframe that looks as the following one:

  chrom  start  end  probability   read
0  chr1      1   10         0.99  read1
1  chr1      5   25         0.99  read2
2  chr1     15   25         0.99  read2
3  chr1     30   40         0.75  read4

What I wanna do is to merge the intervals that have the same chromosome (chrom column), and whose coordinates (start,end) overlap. In some situations, were multiple intervals overlap each other, there will be intervals that should be merged, even though they do not overlap. See row 0 and row 2 in the above mentioned example and the output of the merging below

For those elements that are merged, I want to sum their probabilities (probability column) and count the unique elements in the 'read' column.

Which would lead to the following output using the example above, note that rows 0,1 and 2 have been merged:

 chrom  start  end  probability  read
0  chr1      1   20         2.97     2
1  chr1     30   40         0.75     1

Up to now, I have been doing this with pybedtools merge, but it has turn out that it is slow for doing it millions of times (my case). Hence, I am looking for other options and pandas is the obvious one. I know that with pandas groupby one can apply different operations to the columns that are going to be merged like nunique and sum, which are the ones that I will need to apply. Nevertheless, pandas groupby only merges data with exact 'chrom', 'start' and 'end' coordinates.

My problem is that I don't know how to use pandas to merge my rows based on the coordinates (chrom,start,end) and then apply the sum and nunique operations.

Is there a fast way of doing this?

thanks!

PS: As I have told on my question, I am doing this millions of times, so speed is a big issue. Hence, I am not able to use pybedtools or pure python, which are too slow for my goal.

Thanks!

like image 856
Praderas Avatar asked Feb 12 '18 18:02

Praderas


3 Answers

As suggested by @root, the accepted answer fails to generalize to similar cases. e.g. if we add an extra row with range 2-3 to the example in the question:

df = pd.DataFrame({'chrom': ['chr1','chr1','chr1','chr1','chr1'], 
    'start': [1, 2, 5, 15, 30],
    'end': [10, 3, 20, 25, 40],
    'probability': [0.99, 0.99, 0.99, 0.99, 0.75],
    'read': ['read1','read2','read2','read2','read4']})

...the suggested aggregate function outputs the following dataframe. Note that 4 is in the range 1-10, but it is no longer captured. The ranges 1-10, 2-3, 5-20 and 15-25 all overlap and so should be grouped together.

enter image description here

One solution is the following approach (using the aggregate function suggested by @W-B and the method for combining intervals posted by @CentAu).

# Union intervals by @CentAu
from sympy import Interval, Union
def union(data):
    """ Union of a list of intervals e.g. [(1,2),(3,4)] """
    intervals = [Interval(begin, end) for (begin, end) in data]
    u = Union(*intervals)
    return [u] if isinstance(u, Interval) \
       else list(u.args)

# Get intervals for rows
def f(x,position=None):
    """
    Returns an interval for the row. The start and stop position indicate the minimum
        and maximum position of all overlapping ranges within the group.
    Args: 
        position (str, optional): Returns an integer indicating start or stop position.
    """
    intervals = union(x)
    if position and position.lower() == 'start':
        group = x.str[0].apply(lambda y: [l.start for g,l in enumerate(intervals) if l.contains(y)][0])
    elif position and position.lower() == 'end':
        group = x.str[0].apply(lambda y: [l.end for g,l in enumerate(intervals) if l.contains(y)][0])
    else:
        group = x.str[0].apply(lambda y: [l for g,l in enumerate(intervals) if l.contains(y)][0])
    return group

# Combine start and end into a single column
df['start_end'] = df[['start', 'end']].apply(list, axis=1)

# Assign each row to an interval and add start/end columns
df['start_interval'] = df[['chrom',
    'start_end']].groupby(['chrom']).transform(f,'start')
df['end_interval'] = df[['chrom',
    'start_end']].groupby(['chrom']).transform(f,'end')

# Aggregate rows, using approach by @W-B
df.groupby(['chrom','start_interval','end_interval']).agg({'probability':'sum',
'read':'nunique'}).reset_index()

...which outputs the following dataframe. Summed probability for the first row is 3.96 because we are combining four rows instead of three.

enter image description here

While this approach should be more generalisable, it is not necessarily fast! Hopefully others can suggest speedier alternatives.

like image 107
tomp Avatar answered Oct 13 '22 20:10

tomp


IIUC

df.groupby((df.end.shift()-df.start).lt(0).cumsum()).agg({'chrom':'first','start':'first','end':'last','probability':'sum','read':'nunique'})
Out[417]: 
  chrom  start  end  probability  read
0  chr1      1   20         2.97     2
1  chr1     30   40         0.75     1

More info create the group key

(df.end.shift()-df.start).lt(0).cumsum()
Out[418]: 
0    0
1    0
2    0
3    1
dtype: int32
like image 26
BENY Avatar answered Oct 13 '22 19:10

BENY


Here is an answer using pyranges and pandas. It is improved in that it does the merging really quickly, is easily parallelizeable and super duper fast even in single-core mode.

Setup:

import pandas as pd
import pyranges as pr
import numpy as np

rows = int(1e7)
gr = pr.random(rows)
gr.probability = np.random.rand(rows)
gr.read = np.arange(rows)
print(gr)

# +--------------+-----------+-----------+--------------+----------------------+-----------+
# | Chromosome   | Start     | End       | Strand       | probability          | read      |
# | (category)   | (int32)   | (int32)   | (category)   | (float64)            | (int64)   |
# |--------------+-----------+-----------+--------------+----------------------+-----------|
# | chr1         | 149953099 | 149953199 | +            | 0.7536048547309669   | 0         |
# | chr1         | 184344435 | 184344535 | +            | 0.9358130407479777   | 1         |
# | chr1         | 238639916 | 238640016 | +            | 0.024212603310159064 | 2         |
# | chr1         | 95180042  | 95180142  | +            | 0.027139751993808026 | 3         |
# | ...          | ...       | ...       | ...          | ...                  | ...       |
# | chrY         | 34355323  | 34355423  | -            | 0.8843190383030953   | 999996    |
# | chrY         | 1818049   | 1818149   | -            | 0.23138017743097572  | 999997    |
# | chrY         | 10101456  | 10101556  | -            | 0.3007915302642412   | 999998    |
# | chrY         | 355910    | 356010    | -            | 0.03694752911338561  | 999999    |
# +--------------+-----------+-----------+--------------+----------------------+-----------+
# Stranded PyRanges object has 1,000,000 rows and 6 columns from 25 chromosomes.
# For printing, the PyRanges was sorted on Chromosome and Strand.

Execution:

def praderas(df):
    grpby = df.groupby("Cluster")
    prob = grpby.probability.sum()
    prob.name = "ProbSum"
    n = grpby.read.count()
    n.name = "Count"

    return df.merge(prob, on="Cluster").merge(n, on="Cluster")

%time result = gr.cluster().apply(praderas)
# 11.4s !
result[result.Count > 2]
# +--------------+-----------+-----------+--------------+----------------------+-----------+-----------+--------------------+-----------+
# | Chromosome   | Start     | End       | Strand       | probability          | read      | Cluster   | ProbSum            | Count     |
# | (category)   | (int32)   | (int32)   | (category)   | (float64)            | (int64)   | (int32)   | (float64)          | (int64)   |
# |--------------+-----------+-----------+--------------+----------------------+-----------+-----------+--------------------+-----------|
# | chr1         | 52952     | 53052     | +            | 0.7411051557901921   | 59695     | 70        | 2.2131010082513884 | 3         |
# | chr1         | 52959     | 53059     | +            | 0.9979036360671423   | 356518    | 70        | 2.2131010082513884 | 3         |
# | chr1         | 53029     | 53129     | +            | 0.47409221639405397  | 104776    | 70        | 2.2131010082513884 | 3         |
# | chr1         | 64657     | 64757     | +            | 0.32465233067499366  | 386140    | 88        | 1.3880589602361695 | 3         |
# | ...          | ...       | ...       | ...          | ...                  | ...       | ...       | ...                | ...       |
# | chrY         | 59356855  | 59356955  | -            | 0.3877207561218887   | 9966373   | 8502533   | 1.182153891322546  | 4         |
# | chrY         | 59356865  | 59356965  | -            | 0.4007557656399032   | 9907364   | 8502533   | 1.182153891322546  | 4         |
# | chrY         | 59356932  | 59357032  | -            | 0.33799123310907786  | 9978653   | 8502533   | 1.182153891322546  | 4         |
# | chrY         | 59356980  | 59357080  | -            | 0.055686136451676305 | 9994845   | 8502533   | 1.182153891322546  | 4         |
# +--------------+-----------+-----------+--------------+----------------------+-----------+-----------+--------------------+-----------+
# Stranded PyRanges object has 606,212 rows and 9 columns from 24 chromosomes.
# For printing, the PyRanges was sorted on Chromosome and Strand.
like image 2
The Unfun Cat Avatar answered Oct 13 '22 19:10

The Unfun Cat