How would you group/cluster these three areas in arrays in python?

Tags:

So you have an array

For a better understanding:

For better understanding

How would you group/cluster the three areas in arrays in python(v2.6), so you get three arrays in this case containing

[1 2 3] [60 70 80 100] [220 230 250]

Background:

y-axis is frequency, x-axis is number. These numbers are the ten highest amplitudes being represented by their frequencies. I want to create three discrete numbers from them for pattern recognition. There could be many more points but all of them are grouped by a relatively big frequency difference as you can see in this example between about 50 and about 0 and between about 100 and about 220. Note that what is big and what is small changes but the difference between clusters remains significant compared to the difference between elements of a group/cluster.

635

asked Jan 20 '12 10:01

Zurechtweiser

2 Answers

Observe that your data points are actually one-dimensional if x just represents an index. You can cluster your points using Scipy's cluster.vq module, which implements the k-means algorithm.

>>> import numpy as np
>>> from scipy.cluster.vq import kmeans, vq
>>> y = np.array([1,2,3,60,70,80,100,220,230,250])
>>> codebook, _ = kmeans(y, 3)  # three clusters
>>> cluster_indices, _ = vq(y, codebook)
>>> cluster_indices
array([1, 1, 1, 0, 0, 0, 0, 2, 2, 2])

The result means: the first three points form cluster 1 (an arbitrary label), the next four form cluster 0 and the last three form cluster 2. Grouping the original points according to the indices is left as an exercise for the reader.

For more clustering algorithms in Python, check out scikit-learn.

111

answered Oct 14 '22 06:10

Fred Foo

This is a simple algorithm implemented in python that check whether or not a value is too far (in terms of standard deviation) from the mean of a cluster:

from math import sqrt

def stat(lst):
    """Calculate mean and std deviation from the input list."""
    n = float(len(lst))
    mean = sum(lst) / n
    stdev = sqrt((sum(x*x for x in lst) / n) - (mean * mean)) 
    return mean, stdev

def parse(lst, n):
    cluster = []
    for i in lst:
        if len(cluster) <= 1:    # the first two values are going directly in
            cluster.append(i)
            continue

        mean,stdev = stat(cluster)
        if abs(mean - i) > n * stdev:    # check the "distance"
            yield cluster
            cluster[:] = []    # reset cluster to the empty list

        cluster.append(i)
    yield cluster           # yield the last cluster

This will return what you expect in your example with 5 < n < 9:

>>> array = [1, 2, 3, 60, 70, 80, 100, 220, 230, 250]
>>> for cluster in parse(array, 7):
...     print(cluster)
[1, 2, 3]
[60, 70, 80, 100]
[220, 230, 250]

answered Oct 14 '22 06:10

Rik Poggi

Related questions
                            
                                Computing and drawing vector fields
                            
                                Mixin common fields between serializers in Django Rest Framework
                            
                                pandas - multi index plotting
                            
                                Install Plotly in Anaconda
                            
                                'WSGIRequest' object has no attribute 'session' while upgrading from django 1.3 to 1.9
                            
                                Anaconda: cannot import cv2 even though opencv is installed (how to install opencv3 for python3)
                            
                                Psycopg2 peer authentication for user postgres
                            
                                replacing NaT with 0 days
                            
                                ValueError: Unknown metric function when using custom metric in Keras
                            
                                How to get class variables and type hints?
                            
                                Count occurences of True/False in column of dataframe
                            
                                List comprehension iterating over two lists is not working as expected [duplicate]
                            
                                How to get the device type of a pytorch module conveniently?
                            
                                Unpack to unknown number of variables?
                            
                                Mathematical equation manipulation in Python
                            
                                switch versions of python
                            
                                How does one make logging color in Django/Google App Engine?
                            
                                Hide stderr output in unit tests
                            
                                Call a method of an object with arguments in Python
                            
                                Can I use Python 3 super() in Python 2.5.6?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How would you group/cluster these three areas in arrays in python?

Tags:

python

cluster-analysis

data-mining

pattern-recognition

Zurechtweiser

People also ask

2 Answers

Fred Foo

Rik Poggi

Recent Activity

Donate For Us