So you have an array
1
2
3
60
70
80
100
220
230
250
For a better understanding:
How would you group/cluster the three areas in arrays in python(v2.6), so you get three arrays in this case containing
[1 2 3] [60 70 80 100] [220 230 250]
Background:
y-axis is frequency, x-axis is number. These numbers are the ten highest amplitudes being represented by their frequencies. I want to create three discrete numbers from them for pattern recognition. There could be many more points but all of them are grouped by a relatively big frequency difference as you can see in this example between about 50 and about 0 and between about 100 and about 220. Note that what is big and what is small changes but the difference between clusters remains significant compared to the difference between elements of a group/cluster.
Add a cluster to the front panel. Add the Cluster To Array function to the block diagram. Wire the cluster to the Cluster To Array function. Right-click the Cluster To Array function and select Create»Indicator from the shortcut menu to create an array indicator.
Python offers many useful tools for performing cluster analysis. The best tool to use depends on the problem at hand and the type of data available. There are three widely used techniques for how to form clusters in Python: K-means clustering, Gaussian mixture models and spectral clustering.
Observe that your data points are actually one-dimensional if x
just represents an index. You can cluster your points using Scipy's cluster.vq
module, which implements the k-means algorithm.
>>> import numpy as np
>>> from scipy.cluster.vq import kmeans, vq
>>> y = np.array([1,2,3,60,70,80,100,220,230,250])
>>> codebook, _ = kmeans(y, 3) # three clusters
>>> cluster_indices, _ = vq(y, codebook)
>>> cluster_indices
array([1, 1, 1, 0, 0, 0, 0, 2, 2, 2])
The result means: the first three points form cluster 1
(an arbitrary label), the next four form cluster 0
and the last three form cluster 2
. Grouping the original points according to the indices is left as an exercise for the reader.
For more clustering algorithms in Python, check out scikit-learn.
This is a simple algorithm implemented in python that check whether or not a value is too far (in terms of standard deviation) from the mean of a cluster:
from math import sqrt
def stat(lst):
"""Calculate mean and std deviation from the input list."""
n = float(len(lst))
mean = sum(lst) / n
stdev = sqrt((sum(x*x for x in lst) / n) - (mean * mean))
return mean, stdev
def parse(lst, n):
cluster = []
for i in lst:
if len(cluster) <= 1: # the first two values are going directly in
cluster.append(i)
continue
mean,stdev = stat(cluster)
if abs(mean - i) > n * stdev: # check the "distance"
yield cluster
cluster[:] = [] # reset cluster to the empty list
cluster.append(i)
yield cluster # yield the last cluster
This will return what you expect in your example with 5 < n < 9
:
>>> array = [1, 2, 3, 60, 70, 80, 100, 220, 230, 250]
>>> for cluster in parse(array, 7):
... print(cluster)
[1, 2, 3]
[60, 70, 80, 100]
[220, 230, 250]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With