I've googled, I've tested, and this has me at my wits end. I have a list of numbers I need to group by similarity. For instance, in a list of [1, 6, 9, 100, 102, 105, 109, 134, 139], 1 6 9 would be put into a list, 100, 102, 105, and 109 would be put into a list, and 134 and 139. I'm terrible at math, and I've tried and tried this, but I can't get it to work. To be explicit as possible, I wish to group numbers that are within 10 values away from one another. Can anyone help? Thanks.
There are many ways to do cluster analysis. One simple approach is to look at the gap size between successive data elements:
def cluster(data, maxgap):
'''Arrange data into groups where successive elements
differ by no more than *maxgap*
>>> cluster([1, 6, 9, 100, 102, 105, 109, 134, 139], maxgap=10)
[[1, 6, 9], [100, 102, 105, 109], [134, 139]]
>>> cluster([1, 6, 9, 99, 100, 102, 105, 134, 139, 141], maxgap=10)
[[1, 6, 9], [99, 100, 102, 105], [134, 139, 141]]
'''
data.sort()
groups = [[data[0]]]
for x in data[1:]:
if abs(x - groups[-1][-1]) <= maxgap:
groups[-1].append(x)
else:
groups.append([x])
return groups
if __name__ == '__main__':
import doctest
print(doctest.testmod())
This will find the groups:
nums = [1, 6, 9, 100, 102, 105, 109, 134, 139]
for k, g in itertools.groupby(nums, key=lambda n: n//10):
print k, list(g)
0 [1, 6, 9]
10 [100, 102, 105, 109]
13 [134, 139]
Note that if nums isn't actually sorted as your sample shows, you'll need to sort it first.
First, you can easily convert any sequence into a sequence of pairs of adjacent items. Just tee it, shift it forward, and zip the unshifted and unshifted copies. The only trick is that you need to start with either (<something>, 1)
or (139, <something>)
, because in this case we want not each pair of elements, but a pair for each element:
def pairify(it):
it0, it1 = itertools.tee(it, 2)
first = next(it0)
return zip(itertools.chain([first, first], it0), it1)
(This isn't the simplest way to write it, but I think this may be the way that's most readable to people who aren't familiar with itertools
.)
>>> a = [1, 6, 9, 100, 102, 105, 109, 134, 139]
>>> list(pairify(a))
[(1, 1), (1, 6), (6, 9), (9, 100), (100, 102), (102, 105), (105, 109), (109, 134), (134, 139)]
Then, with a slightly more complicated version of Ned Batchelder's key, you can just use groupby
.
However, I think in this case this will end up being more complicated than an explicit generator that does the same thing.
def cluster(sequence, maxgap):
batch = []
for prev, val in pairify(sequence):
if val - prev >= maxgap:
yield batch
batch = []
else:
batch.append(val)
if batch:
yield batch
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With