Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to cluster by trend instead of by distance in R?

The k-medoids in the clara() function uses distance to form clusters so I get this pattern:

a <- matrix(c(0,1,3,2,0,.32,1,.5,0,.35,1.2,.4,.5,.3,.2,.1,.5,.2,0,-.1), byrow=T, nrow=5)
cl <- clara(a,2)
matplot(t(a),type="b", pch=20, col=cl$clustering) 

clustering by clara()

But I want to find a clustering method that assigns a cluster to each line according to its trend, so lines 1, 2 and 3 belong to one cluster and lines 4 and 5 to another.

like image 461
nachocab Avatar asked May 11 '12 17:05

nachocab


People also ask

Are clustering methods which are based on distance?

Distance based methods optimize a global criteria based on the distance between the patterns. k-means, CLARA, CLARANS are examples of dis- tance based clustering method. Density based methods optimize local criteria based on density information of the patterns.

How do I cluster a time series data in R?

For time series clustering with R, the first step is to work out an appropriate distance/similarity metric, and then, at the second step, use existing clustering techniques, such as k-means, hierarchical clustering, density-based clustering or subspace clustering, to find clustering structures.

How do you choose a clustering technique?

To answer that question, we need to consider the algorithm, the data we are using, and the application being built. Taking all that into account, you will be able to choose a clustering algorithm that will make your analysis work fast and efficient. As always, remember that in data science, it's always about the data.


2 Answers

This question might be better suited to stats.stackexchange.com, but here's a solution anyway.

Your question is actually "How do I pick the right distance metric?". Instead of Euclidean distance between these vectors, you want a distance that measures similarity in trend.

Here's one option:

a1 <- t(apply(a,1,scale))
a2 <- t(apply(a1,1,diff))

cl <- clara(a2,2)
matplot(t(a),type="b", pch=20, col=cl$clustering) 

enter image description here

Instead of defining a new distance metric, I've accomplished essentially the same thing by transforming the data. First scaling each row, so that we can compare relative trends without differences in scale throwing us off. Next, we just convert the data to the differences.

Warning: This is not necessarily going to work for all "trend" data. In particular, looking at successive differences only captures a single, limited facet of "trend". You may have to put some thought into more sophisticated metrics.

like image 118
joran Avatar answered Sep 28 '22 02:09

joran


Do more preprocessing. To any data mining, preprocessing is 90% of the effort.

For example, if you want to cluster by trends, then you maybe should apply the clustering to the trends, and not the raw values. So for example, standardize the curves each to a mean of 0 and a standard deviation of 1. Then compute the differences from one value to the next, then apply the clustering to this preprocessed data!

like image 32
Has QUIT--Anony-Mousse Avatar answered Sep 28 '22 00:09

Has QUIT--Anony-Mousse