The k-medoids in the clara() function uses distance to form clusters so I get this pattern: <pre class="prettyprint"><code>a <- matrix(c(0,1,3,2,0,.32,1,.5,0,.35,1.2,.4,.5,.3,.2,.1,.5,.2,0,-.1), byrow=T, nrow=5) cl <- clara(a,2) matplot(t(a),type="b", pch=20, col=cl$clustering) </code></pre> <img src="https://i.stack.imgur.com/M6Y7i.png" alt="clustering by clara()"> But I want to find a clustering method that assigns a cluster to each line according to its trend, so lines 1, 2 and 3 belong to one cluster and lines 4 and 5 to another.

This question might be better suited to stats.stackexchange.com, but here's a solution anyway. Your question is actually "How do I pick the right distance metric?". Instead of Euclidean distance between these vectors, you want a distance that measures similarity in trend. Here's one option: <pre class="prettyprint"><code>a1 <- t(apply(a,1,scale)) a2 <- t(apply(a1,1,diff)) cl <- clara(a2,2) matplot(t(a),type="b", pch=20, col=cl$clustering) </code></pre> <img src="https://i.stack.imgur.com/yr0Rq.png" alt="enter image description here"> Instead of defining a new distance metric, I've accomplished essentially the same thing by transforming the data. First scaling each row, so that we can compare relative trends without differences in scale throwing us off. Next, we just convert the data to the differences. Warning: This is not necessarily going to work for all "trend" data. In particular, looking at successive differences only captures a single, limited facet of "trend". You may have to put some thought into more sophisticated metrics.

How to cluster by trend instead of by distance in R?

Tags:

r

cluster-analysis

The k-medoids in the clara() function uses distance to form clusters so I get this pattern:

a <- matrix(c(0,1,3,2,0,.32,1,.5,0,.35,1.2,.4,.5,.3,.2,.1,.5,.2,0,-.1), byrow=T, nrow=5)
cl <- clara(a,2)
matplot(t(a),type="b", pch=20, col=cl$clustering)

clustering by clara()

But I want to find a clustering method that assigns a cluster to each line according to its trend, so lines 1, 2 and 3 belong to one cluster and lines 4 and 5 to another.

461

asked May 11 '12 17:05

nachocab

2 Answers

This question might be better suited to stats.stackexchange.com, but here's a solution anyway.

Your question is actually "How do I pick the right distance metric?". Instead of Euclidean distance between these vectors, you want a distance that measures similarity in trend.

Here's one option:

a1 <- t(apply(a,1,scale))
a2 <- t(apply(a1,1,diff))

cl <- clara(a2,2)
matplot(t(a),type="b", pch=20, col=cl$clustering)

enter image description here

Instead of defining a new distance metric, I've accomplished essentially the same thing by transforming the data. First scaling each row, so that we can compare relative trends without differences in scale throwing us off. Next, we just convert the data to the differences.

Warning: This is not necessarily going to work for all "trend" data. In particular, looking at successive differences only captures a single, limited facet of "trend". You may have to put some thought into more sophisticated metrics.

118

answered Sep 28 '22 02:09

joran

Do more preprocessing. To any data mining, preprocessing is 90% of the effort.

For example, if you want to cluster by trends, then you maybe should apply the clustering to the trends, and not the raw values. So for example, standardize the curves each to a mean of 0 and a standard deviation of 1. Then compute the differences from one value to the next, then apply the clustering to this preprocessed data!

answered Sep 28 '22 00:09

Has QUIT--Anony-Mousse

Related questions
                            
                                How to use faceting with geom_polygon to generate a grid of maps
                            
                                can one offset jitter points in ggplot boxplot
                            
                                Big Merge / Memory management
                            
                                Documenting setter functions with roxygen
                            
                                Cumulative frequency by factor
                            
                                How to count the number of concurrent users using time interval data?
                            
                                Remove rows from data: overlapping time intervals?
                            
                                legend venn diagram in venneuler
                            
                                How to create ID column in R
                            
                                What is the best method to bin intraday volume figures from a stock price timeseries using XTS / ZOO etc in R?
                            
                                Staggered and stacked geom_bar in the same figure?
                            
                                Subtract previous year's from value from each grouped row in data frame
                            
                                R: creating a matrix with unknown number of rows
                            
                                Faster code in R
                            
                                how to convert string into math expression in R?
                            
                                How to get coordinates of a path from svg file into R
                            
                                Exponent in a plotmath expression
                            
                                Using the glmulti package in R for exhaustive search multiple regression for akaike weights
                            
                                Is there a viable handwriting recognition library / program? [closed]
                            
                                How do use different points sizes to represent the amount in the location of that point

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With