Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas DataFrame.groupby with a tolerance

Given the following bit of some data:

data = {'Object': ['objA', 'objB', 'objC', 'objD', 'objE'],
        'Length': [10.1, 10.02, 7.4, 6.24, 5.99]}

df = pd.DataFrame(data)
df

Which results in the following dataframe:

Out[6]:
   Length Object
0   10.10   objA
1   10.02   objB
2    7.40   objC
3    6.24   objD
4    5.99   objE

I'd like to group the 'Length' column based on a +- tolerance. Doing so would give me the following groups. Something like the psuedocode below:

tolerance = .25
grouped = df.groupby(df['Length'] +- tolerance)

Which would result with a grouping similar to the one below:

{(10.10+-.25): [0L, 1L],
 (7.40+-.25):  [2L],
 (6.24+-.25):  [3L, 4L]}

Looking around, folks have suggested using pd.cut and predefining bins, however, given the true size of my dataset and the variability of the lengths, pre-computing the bin ranges seems to be a bit of a brute force solution. Does anyone out there have a more elegant/fast/pandas/numpy-esque solution?

like image 975
destructo Avatar asked Nov 09 '22 19:11

destructo


1 Answers

I'd suggest using the intervaltree package on PyPI, instead of a pandas/numpy-esque solution.

The idea is to add each length +/- tolerance interval to the interval tree, having the interval map to the associated object. Then, iterate over the lengths and query the interval tree. This will give you all of the objects that have a tolerance interval containing the queried length.

from intervaltree import IntervalTree

t = IntervalTree()
for length, obj in zip(data['Length'], data['Object']):
    t[length-tolerance:length+tolerance] = obj

result = {}
for length in data['Length']:
    objs = [iv.data for iv in t[length]]
    result[length] = objs

The result dictionary is as follows:

{10.1: ['objA', 'objB'], 5.99: ['objD', 'objE'], 10.02: ['objA', 'objB'], 6.24: ['objD'], 7.4: ['objC']}

It's not quite in the format you specified, but it should be straightforward enough to make any changes to the format that you need.

like image 89
root Avatar answered Nov 15 '22 11:11

root