Lets say you have a DataFrame of regions (start, end) coordinates and another DataFrame of positions which may or may not fall within a given region. For example:
region = pd.DataFrame({'chromosome': [1, 1, 1, 1, 2, 2, 2, 2], 'start': [1000, 2000, 3000, 4000, 1000, 2000, 3000, 4000], 'end': [2000, 3000, 4000, 5000, 2000, 3000, 4000, 5000]})
position = pd.DataFrame({'chromosome': [1, 2, 1, 3, 2, 1, 1], 'BP': [1500, 1100, 10000, 2200, 3300, 400, 5000]})
print region
print position
   chromosome   end  start
0           1  2000   1000
1           1  3000   2000
2           1  4000   3000
3           1  5000   4000
4           2  2000   1000
5           2  3000   2000
6           2  4000   3000
7           2  5000   4000
      BP  chromosome
0   1500           1
1   1100           2
2  10000           1
3   2200           3
4   3300           2
5    400           1
6   5000           1
A position falls within a region if:
position['BP'] >= region['start'] &
position['BP'] <= region['end'] &
position['chromosome'] == region['chromosome']
Each position is guaranteed to fall within a maximum of one region although it might not fall in any.
What is the best way to merge these two dataframe such that it appends additional columns to position with the region it falls in if it falls in any region. Giving in this case roughly the following output:
      BP  chromosome  start  end
0   1500           1  1000   2000
1   1100           2  1000   2000
2  10000           1  NA     NA
3   2200           3  NA     NA
4   3300           2  3000   4000
5    400           1  NA     NA
6   5000           1  4000   5000
One approach is to write a function to compute the relationship I want and then to use the DataFrame.apply method as follows:
def within(pos, regs):
    istrue = (pos.loc['chromosome'] == regs['chromosome']) & (pos.loc['BP'] >= regs['start']) &  (pos.loc['BP'] <= regs['end'])
    if istrue.any():
        ind = regs.index[istrue].values[0]
        return(regs.loc[ind ,['start', 'end']])
    else:
        return(pd.Series([None, None], index=['start', 'end']))
position[['start', 'end']] = position.apply(lambda x: within(x, region), axis=1)
print position
      BP  chromosome  start   end
0   1500           1   1000  2000
1   1100           2   1000  2000
2  10000           1    NaN   NaN
3   2200           3    NaN   NaN
4   3300           2   3000  4000
5    400           1    NaN   NaN
6   5000           1   4000  5000
But I'm hoping that there is a more optimized way than doing each comparison in O(N) time. Thanks!
One solution would be to do an inner-join on chromosome, exclude the violating rows, and then do left-join with position:
>>> df = pd.merge(position, region, on='chromosome', how='inner')
>>> idx = (df['BP'] < df['start']) | (df['end'] < df['BP'])  # violating rows
>>> pd.merge(position, df[~idx], on=['BP', 'chromosome'], how='left')
      BP  chromosome   end  start
0   1500           1  2000   1000
1   1100           2  2000   1000
2  10000           1   NaN    NaN
3   2200           3   NaN    NaN
4   3300           2  4000   3000
5    400           1   NaN    NaN
6   5000           1  5000   4000
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With