Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R kmeans (stats) vs Kmeans (amap)

Tags:

r

k-means

Hello stackoverflow community,

I'm running kmeans (stats package) and Kmeans (amap package) on the Iris dataset. In both cases, I use the same algorithm (Lloyd–Forgy), the same distance (euclidean), the same number of initial random sets (50), the same maximal number of iterations (1000), and I test for the same set of k values (from 2 to 15). I also use the same seed for both cases (4358).

I don't understand why under these conditions I'm getting different wss curves, in particular: the "elbow" using the stats package is much less accentuated than when using the amap package.

Could you please help me to understand why? Thanks much!

Here the code:

# data load and scaling
newiris <- iris
newiris$Species <- NULL
newiris <- scale(newiris)

# using kmeans (stats)
wss1 <- (nrow(newiris)-1)*sum(apply(newiris,2,var))
for (i in 2:15) {
  set.seed(4358)
  wss1[i] <- sum(kmeans(newiris, centers=i, iter.max=1000, nstart=50,
                       algorithm="Lloyd")$withinss)
  }

# using Kmeans (amap)
library(amap)
wss2 <- (nrow(newiris)-1)*sum(apply(newiris,2,var))
for (i in 2:15) {
  set.seed(4358)
  wss2[i] <- sum(Kmeans(newiris, centers=i, iter.max=1000, nstart=50,
                       method="euclidean")$withinss)
  }

# plots
plot(1:15, wss1, type="b", xlab="Number of Clusters",
     ylab="Within groups sum of squares", main="kmeans (stats package)")
plot(1:15, wss2, type="b", xlab="Number of Clusters",
     ylab="Within groups sum of squares", main="Kmeans (amap package)")

EDIT: I've emailed the author of the amap package and will post the reply when/if I get any. https://cran.r-project.org/web/packages/amap/index.html

like image 409
pim Avatar asked Sep 07 '15 13:09

pim


1 Answers

The author of the amap package, changed the code and the value of withinss variable is the sum applied by method (eg. euclidean distance).

One way to solve this, given the return of Kmeans function (amap), recalculate the value of withinss ( Error Sum of Squares (SSE) ).

Here is my suggestion:

# using Kmeans (amap)

    library(amap)

    wss2 <- (nrow(newiris)-1)*sum(apply(newiris,2,var))

    for (i in 2:15) {

            set.seed(4358)

            ans.Kmeans <- Kmeans(newiris, centers=i, iter.max=1000, nstart=50, method="euclidean")

            wss <- vector(mode = "numeric", length=i) 

            for (j in 1:i) {
                    km = as.matrix(newiris[which(ans.Kmeans$cluster %in% j),])

                    ## average = as.matrix( t(apply(km,2,mean) )) 
                    ## wss[j] =  sum( apply(km, 1, function(x) sum((x-average) ^ 2 )))
                    ## or                         
                    wss[j] <- ( nrow(km)-1) * sum(apply(km,2,var))
            }

            wss2[i] = sum(wss)
    }

Note. The method for pearson in this package is wrong (be careful !) on version 0.8-14.

Line 325 according code in this link:

https://github.com/cran/amap/blob/master/src/distance_T.inl

like image 119
Ricardo Jacomini Avatar answered Sep 20 '22 22:09

Ricardo Jacomini