Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

K-means: Initial centers are not distinct

I am using the GA Package and my aim is to find the optimal initial centroids positions for k-means clustering algorithm. My data is a sparse-matrix of words in TF-IDF score and is downloadable here. Below are some of the stages I have implemented:

0. Libraries and dataset

library(clusterSim)           ## for index.DB()
library(GA)                   ## for ga() 

corpus <- read.csv("Corpus_EnglishMalay_tfidf.csv")     ## a dataset of 5000 x 1168

1. Binary encoding and generate initial population.

k_min <- 15

initial_population <- function(object) {
    ## generate a population to turn-on 15 cluster bits
    init <- t(replicate(object@popSize, sample(rep(c(1, 0), c(k_min, object@nBits - k_min))), TRUE))
    return(init)
}

2. Fitness Function Minimizes Davies-Bouldin (DB) Index. Where I evaluate DBI for each solution generated from initial_population.

DBI2 <- function(x) {
    ## x is a vector of solution of nBits 
    ## exclude first column of corpus
    initial_centroid <- corpus[x==1, -1]
    cl <- kmeans(corpus[-1], initial_centroid)
    dbi <- index.DB(corpus[-1], cl=cl$cluster, centrotypes = "centroids")
    score <- -dbi$DB
    return(score) 
}

3. Running GA. With these settings.

g2<- ga(type = "binary", 
    fitness = DBI2, 
    population = initial_population,
    selection = ga_rwSelection,
    crossover = gabin_spCrossover,
    pcrossover = 0.8,
    pmutation = 0.1,
    popSize = 100, 
    nBits = nrow(corpus),
    seed = 123)

4. The problem. Error in kmeans(corpus[-1], initial_centroid) : initial centers are not distinct`.

I found a similar problem here, where the user also had to used a parameter to dynamically pass in the number of clusters to use. It was solve by hard-coding the number of clusters. However for my case, I really need to dynamically pass in the number of clusters, since it is coming in from a randomly generated binary vector, where those 1's will represent the initial centroids.

Checking with the kmeans() code, I noticed that the error is caused by duplicated centers:

if(any(duplicated(centers)))
        stop("initial centers are not distinct")

I edited the kmeans function with trace to print out the duplicated centers. The output:

 [1] "206"  "520"  "564"  "1803" "2059" "2163" "2652" "2702" "3195" "3206" "3254" "3362" "3375"
[14] "4063" "4186"

Which shows no duplication in the randomly selected initial_centroids and I have no idea why this error keeps occurring. Is there anything else that would lead to this error?

P/S: I do understand some may suggest GA + K-means is not a good idea. But I do hope to finish what I have started. It is better to view this problem as a K-means problem (well at least in solving the initial centers are not distinct error).

like image 232
jacky_learns_to_code Avatar asked Feb 15 '17 13:02

jacky_learns_to_code


People also ask

Does k-means depend on initial selection of cluster center?

Traditional k-Means algorithm selects initial centroids randomly and in k-Means algorithm result of clustering highly depends on selection of initial centroids. k-Means algorithm is sensitive to initial centroids so proper selection of initial centroids is necessary.

Can we choose any random initial centroids at the beginning of k-means?

Specifically, K-means tends to perform better when centroids are seeded in such a way that doesn't clump them together in space. In short, the method is as follows: Choose one of your data points at random as an initial centroid. Calculate D(x), the distance between your initial centroid and all other data points, x.

Why is k-means sensitive to initialization?

The K-means problem itself is NP-hard, so any algorithm with a runtime that's practically usable will only give a locally optimal solution. The fact that we'll converge to a local minimum is what makes the procedure sensitive to initialization conditions.

Does centroid initialization affect k-means algorithm?

8. Does centroid initialization affect K means Algorithm? Yes, the final results of the k means algorithm depend on the centroid initialization as poor initialization can cause the algorithm to get stuck into an inferior local minimum.


2 Answers

Genetic algorithms are not well suited for optimizing k-means by the nature of the problem - initialization seeds interact too much, ga will not be better than taking a random sample of all possible seeds.

So my main advise is to not use genetic algorithms at all here!

If you insist, what you would need to do is detect the bad parameters, then simply return a bad score for bad initialization so they don't "survive".

like image 145
Has QUIT--Anony-Mousse Avatar answered Nov 15 '22 08:11

Has QUIT--Anony-Mousse


To answer your question just do:

any(corpus[520, -1] != corpus[564, -1])

Your 520 and 564 rows of corpus are the same, with the only difference in an attribute row.names, see:

identical(colnames(corpus[520, -1]), colnames(corpus[564, -1])) # just to be sure
rownames(corpus[520, -1])
rownames(corpus[564, -1])

Regarding the GA and k-means, see e.g.:

  1. Bashar Al-Shboul, Myaeng Sung-Hyon, "Initializing K-Means using Genetic Algorithms", World Academy of Science, Engineering & Technology, Jun2009, Issue 30, p. 114, (especially section II B); or
  2. BAIN KHUSUL KHOTIMAH, FIRLI IRHAMNI, AND TRI SUNDARWATI, "A GENETIC ALGORITHM FOR OPTIMIZED INITIAL CENTERS K-MEANS CLUSTERING IN SMEs", Journal of Theoretical and Applied Information Technology, 2016, Vol. 90, No. 1
like image 39
m-dz Avatar answered Nov 15 '22 09:11

m-dz