Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Identifying groups of similar numbers in a list

Tags:

python

list

I have lists of numbers that I'd like to group by similarity. The order of the numbers in the list is fixed and important to preserve.

As an example, here's a visualisation of what I'm trying to achieve:

Black line is the list of numbers, green lines are the identified groups of similar numbers I'd like to identify, corresponding with that section of the list.

The black line represents the list of numbers I have. The green lines represent the groupings I would like to identify in this example list.

The order of numbers in the list is important and cannot be changed (e.g. cannot sort ascending or descending). The numbers in the list are not contiguous (i.e. there isn't likely to be a list of 6, 6, 6, 6, but probably would be something like 5.85, 6.1, 5.96, 5.88).

Is there a method to do this?

Edit: example values, and desired groupings:

[4.1, 4.05, 4.14, 4.01, 3.97, 4.52, 4.97, 5.02, 5.05, 5.2, 5.18, 3.66, 3.77, 3.59, 3.72]

would result in an approximate grouping of

[(4.1, 4.05, 4.14, 4.01, 3.97, 4.52), (4.97, 5.02, 5.05, 5.2, 5.18), (3.66, 3.77, 3.59, 3.72)]

In the grouping above, you could argue that 4.52 could belong to the first or second group. If visualised as I did in the example above, the groupings would be represented by the green lines. My lists are actually several hundred to several thousand values in length.

like image 513
J.P. Avatar asked Dec 06 '25 00:12

J.P.


2 Answers

You may use itertools.groupby - it combines consecutive elements with same result of given key function (round in this case):

In [7]: import itertools

In [8]: data = [4.1, 4.05, 4.14, 4.01, 3.97, 4.52, 4.97, 5.02, 5.05, 5.2, 5.18, 3.66, 3.77, 3.59, 3.72]

In [9]: [tuple(xs) for _, xs in itertools.groupby(data, round)]
Out[9]: 
[(4.1, 4.05, 4.14, 4.01, 3.97),
 (4.52, 4.97, 5.02, 5.05, 5.2, 5.18),
 (3.66, 3.77, 3.59, 3.72)]
like image 181
awesoon Avatar answered Dec 07 '25 15:12

awesoon


from statistics import mean

def ordered_cluster(data, max_diff):
    current_group = ()
    for item in data:
        test_group = current_group + (item, )
        test_group_mean = mean(test_group)
        if all((abs(test_group_mean - test_item) < max_diff for test_item in test_group)):
            current_group = test_group
        else:
            yield current_group
            current_group = (item, )
    if current_group:
        yield current_group

data = [4.1, 4.05, 4.14, 4.01, 3.97, 4.52, 4.97, 5.02, 5.05, 5.2, 5.18, 3.66, 3.77, 3.59, 3.72]

print(list(ordered_cluster(data, 0.5)))

Output :

[(4.1, 4.05, 4.14, 4.01, 3.97, 4.52), (4.97, 5.02, 5.05, 5.2, 5.18), (3.66, 3.77, 3.59, 3.72)]

This ensures that each item from a group does not exceed max_diff to the mean of the group. If it does, a new group is started.

like image 24
Gary van der Merwe Avatar answered Dec 07 '25 15:12

Gary van der Merwe



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!