The k-medoids in the clara() function uses distance to form clusters so I get this pattern:
a <- matrix(c(0,1,3,2,0,.32,1,.5,0,.35,1.2,.4,.5,.3,.2,.1,.5,.2,0,-.1), byrow=T, nrow=5)
cl <- clara(a,2)
matplot(t(a),type="b", pch=20, col=cl$clustering)
But I want to find a clustering method that assigns a cluster to each line according to its trend, so lines 1, 2 and 3 belong to one cluster and lines 4 and 5 to another.
Distance based methods optimize a global criteria based on the distance between the patterns. k-means, CLARA, CLARANS are examples of dis- tance based clustering method. Density based methods optimize local criteria based on density information of the patterns.
For time series clustering with R, the first step is to work out an appropriate distance/similarity metric, and then, at the second step, use existing clustering techniques, such as k-means, hierarchical clustering, density-based clustering or subspace clustering, to find clustering structures.
To answer that question, we need to consider the algorithm, the data we are using, and the application being built. Taking all that into account, you will be able to choose a clustering algorithm that will make your analysis work fast and efficient. As always, remember that in data science, it's always about the data.
This question might be better suited to stats.stackexchange.com, but here's a solution anyway.
Your question is actually "How do I pick the right distance metric?". Instead of Euclidean distance between these vectors, you want a distance that measures similarity in trend.
Here's one option:
a1 <- t(apply(a,1,scale))
a2 <- t(apply(a1,1,diff))
cl <- clara(a2,2)
matplot(t(a),type="b", pch=20, col=cl$clustering)
Instead of defining a new distance metric, I've accomplished essentially the same thing by transforming the data. First scaling each row, so that we can compare relative trends without differences in scale throwing us off. Next, we just convert the data to the differences.
Warning: This is not necessarily going to work for all "trend" data. In particular, looking at successive differences only captures a single, limited facet of "trend". You may have to put some thought into more sophisticated metrics.
Do more preprocessing. To any data mining, preprocessing is 90% of the effort.
For example, if you want to cluster by trends, then you maybe should apply the clustering to the trends, and not the raw values. So for example, standardize the curves each to a mean of 0 and a standard deviation of 1. Then compute the differences from one value to the next, then apply the clustering to this preprocessed data!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With