Given the following bit of some data:
data = {'Object': ['objA', 'objB', 'objC', 'objD', 'objE'],
'Length': [10.1, 10.02, 7.4, 6.24, 5.99]}
df = pd.DataFrame(data)
df
Which results in the following dataframe:
Out[6]:
Length Object
0 10.10 objA
1 10.02 objB
2 7.40 objC
3 6.24 objD
4 5.99 objE
I'd like to group the 'Length' column based on a +- tolerance. Doing so would give me the following groups. Something like the psuedocode below:
tolerance = .25
grouped = df.groupby(df['Length'] +- tolerance)
Which would result with a grouping similar to the one below:
{(10.10+-.25): [0L, 1L],
(7.40+-.25): [2L],
(6.24+-.25): [3L, 4L]}
Looking around, folks have suggested using pd.cut and predefining bins, however, given the true size of my dataset and the variability of the lengths, pre-computing the bin ranges seems to be a bit of a brute force solution. Does anyone out there have a more elegant/fast/pandas/numpy-esque solution?
I'd suggest using the intervaltree
package on PyPI, instead of a pandas/numpy-esque solution.
The idea is to add each length +/- tolerance interval to the interval tree, having the interval map to the associated object. Then, iterate over the lengths and query the interval tree. This will give you all of the objects that have a tolerance interval containing the queried length.
from intervaltree import IntervalTree
t = IntervalTree()
for length, obj in zip(data['Length'], data['Object']):
t[length-tolerance:length+tolerance] = obj
result = {}
for length in data['Length']:
objs = [iv.data for iv in t[length]]
result[length] = objs
The result
dictionary is as follows:
{10.1: ['objA', 'objB'], 5.99: ['objD', 'objE'], 10.02: ['objA', 'objB'], 6.24: ['objD'], 7.4: ['objC']}
It's not quite in the format you specified, but it should be straightforward enough to make any changes to the format that you need.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With