K-means: Initial centers are not distinct

Tags:

I am using the GA Package and my aim is to find the optimal initial centroids positions for k-means clustering algorithm. My data is a sparse-matrix of words in TF-IDF score and is downloadable here. Below are some of the stages I have implemented:

0. Libraries and dataset

library(clusterSim)           ## for index.DB()
library(GA)                   ## for ga() 

corpus <- read.csv("Corpus_EnglishMalay_tfidf.csv")     ## a dataset of 5000 x 1168

1. Binary encoding and generate initial population.

k_min <- 15

initial_population <- function(object) {
    ## generate a population to turn-on 15 cluster bits
    init <- t(replicate(object@popSize, sample(rep(c(1, 0), c(k_min, object@nBits - k_min))), TRUE))
    return(init)
}

2. Fitness Function Minimizes Davies-Bouldin (DB) Index. Where I evaluate DBI for each solution generated from initial_population.

DBI2 <- function(x) {
    ## x is a vector of solution of nBits 
    ## exclude first column of corpus
    initial_centroid <- corpus[x==1, -1]
    cl <- kmeans(corpus[-1], initial_centroid)
    dbi <- index.DB(corpus[-1], cl=cl$cluster, centrotypes = "centroids")
    score <- -dbi$DB
    return(score) 
}

3. Running GA. With these settings.

g2<- ga(type = "binary", 
    fitness = DBI2, 
    population = initial_population,
    selection = ga_rwSelection,
    crossover = gabin_spCrossover,
    pcrossover = 0.8,
    pmutation = 0.1,
    popSize = 100, 
    nBits = nrow(corpus),
    seed = 123)

4. The problem. Error in kmeans(corpus[-1], initial_centroid) : initial centers are not distinct`.

I found a similar problem here, where the user also had to used a parameter to dynamically pass in the number of clusters to use. It was solve by hard-coding the number of clusters. However for my case, I really need to dynamically pass in the number of clusters, since it is coming in from a randomly generated binary vector, where those 1's will represent the initial centroids.

Checking with the kmeans() code, I noticed that the error is caused by duplicated centers:

if(any(duplicated(centers)))
        stop("initial centers are not distinct")

I edited the kmeans function with trace to print out the duplicated centers. The output:

 [1] "206"  "520"  "564"  "1803" "2059" "2163" "2652" "2702" "3195" "3206" "3254" "3362" "3375"
[14] "4063" "4186"

Which shows no duplication in the randomly selected initial_centroids and I have no idea why this error keeps occurring. Is there anything else that would lead to this error?

P/S: I do understand some may suggest GA + K-means is not a good idea. But I do hope to finish what I have started. It is better to view this problem as a K-means problem (well at least in solving the initial centers are not distinct error).

232

asked Feb 15 '17 13:02

jacky_learns_to_code

2 Answers

Genetic algorithms are not well suited for optimizing k-means by the nature of the problem - initialization seeds interact too much, ga will not be better than taking a random sample of all possible seeds.

So my main advise is to not use genetic algorithms at all here!

If you insist, what you would need to do is detect the bad parameters, then simply return a bad score for bad initialization so they don't "survive".

145

answered Nov 15 '22 08:11

Has QUIT--Anony-Mousse

To answer your question just do:

any(corpus[520, -1] != corpus[564, -1])

Your 520 and 564 rows of corpus are the same, with the only difference in an attribute row.names, see:

identical(colnames(corpus[520, -1]), colnames(corpus[564, -1])) # just to be sure
rownames(corpus[520, -1])
rownames(corpus[564, -1])

Regarding the GA and k-means, see e.g.:

Bashar Al-Shboul, Myaeng Sung-Hyon, "Initializing K-Means using Genetic Algorithms", World Academy of Science, Engineering & Technology, Jun2009, Issue 30, p. 114, (especially section II B); or
BAIN KHUSUL KHOTIMAH, FIRLI IRHAMNI, AND TRI SUNDARWATI, "A GENETIC ALGORITHM FOR OPTIMIZED INITIAL CENTERS K-MEANS CLUSTERING IN SMEs", Journal of Theoretical and Applied Information Technology, 2016, Vol. 90, No. 1

answered Nov 15 '22 09:11

m-dz

Related questions
                            
                                ggplot: How to produce a gradient fill within a geom_polygon
                            
                                Can I automatically generate unit tests for testthat from roxygen2 examples?
                            
                                How to check if an object is visible in a webpage by using its xpath?
                            
                                What is the preferred method for sharing compiled C code in an R package and running it from another?
                            
                                How does the `[<-` function work in R?
                            
                                How do I set up Travis for non-package code?
                            
                                How do I create a leaflet map with thousands of marks that doesn't crash my browser?
                            
                                Can you make geom_ribbon leave a gap for missing values?
                            
                                package car unable to load, wrong version of nlme
                            
                                Image doesn't display RShiny [duplicate]
                            
                                unable to install gganimate package from Github, both automatically and manually
                            
                                "Globe"-shaped map of Russia
                            
                                How do I format columns of a datatable using renderDataTable() in the DT package?
                            
                                How does one use special symbols in a choices list as names?
                            
                                How can I embed a plot within a RMarkdown table?
                            
                                add more than one edge based on edge attributes using igraph
                            
                                Bookdown: Set Page Breaks
                            
                                Why is R slower on my (stronger) Desktop than on my (weaker) laptop?
                            
                                How to have title in R Vennerable Venn Diagram?
                            
                                Aligning / setting width of margin/figure region in ggplot2

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

K-means: Initial centers are not distinct

Tags:

optimization

r

k-means

genetic-algorithm

sparse-matrix

jacky_learns_to_code

People also ask

2 Answers

Has QUIT--Anony-Mousse

m-dz

Recent Activity

Donate For Us