Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clustering values by their proximity in python (machine learning?) [duplicate]

Tags:

I have an algorithm that is running on a set of objects. This algorithm produces a score value that dictates the differences between the elements in the set.

The sorted output is something like this:

[1,1,5,6,1,5,10,22,23,23,50,51,51,52,100,112,130,500,512,600,12000,12230]

If you lay these values down on a spreadsheet you see that they make up groups

[1,1,5,6,1,5] [10,22,23,23] [50,51,51,52] [100,112,130] [500,512,600] [12000,12230]

Is there a way to programatically get those groupings?

Maybe some clustering algorithm using a machine learning library? Or am I overthinking this?

I've looked at scikit but their examples are way too advanced for my problem...

like image 246
PCoelho Avatar asked Aug 21 '13 17:08

PCoelho


People also ask

What is clustering in Python?

Clustering is a set of techniques used to partition data into groups, or clusters. Clusters are loosely defined as groups of data objects that are more similar to other objects in their cluster than they are to data objects in other clusters. In practice, clustering helps identify two qualities of data: Meaningfulness.

What is clustering algorithm in machine learning?

In machine learning too, we often group examples as a first step to understand a subject (data set) in a machine learning system. Grouping unlabeled examples is called clustering. As the examples are unlabeled, clustering relies on unsupervised machine learning.

What are the two main types of clustering methods?

There are two different types of clustering, which are hierarchical and non-hierarchical methods. Non-hierarchical Clustering In this method, the dataset containing N objects is divided into M clusters. In business intelligence, the most widely used non-hierarchical clustering technique is K-means.


1 Answers

Don't use clustering for 1-dimensional data

Clustering algorithms are designed for multivariate data. When you have 1-dimensional data, sort it, and look for the largest gaps. This is trivial and fast in 1d, and not possible in 2d. If you want something more advanced, use Kernel Density Estimation (KDE) and look for local minima to split the data set.

There are a number of duplicates of this question:

  • 1D Number Array Clustering
  • Cluster one-dimensional data optimally?
like image 106
Has QUIT--Anony-Mousse Avatar answered Sep 24 '22 03:09

Has QUIT--Anony-Mousse