Kmeans on a million observations in R - trouble plotting clusters

Question

I am trying to perform KMeans clustering on over a million rows with 4 observations, all numeric. I am using the following code:

kmeansdf<-as.data.frame(rbind(train$V3,train$V5,train$V8,train$length))
km<-kmeans(kmeansdf,2)

As it can be seen, I would like to divide my data into two clusters. The object km is getting populated but I am having trouble plotting the results. Here is the code I am using to plot:

plot(kmeansdf,col=km$cluster)

This piece of code gives me the following error:

Error in plot.new() : figure margins too large

I tried researching online but could not find a solution, I tried working on command line as well but still getting the same error (I am using RStudio at the moment)

Any help to resolve the error would be highly appreciated. TIA.

jlhoward · Accepted Answer

When I run your code on a df with 1e6 rows, I don't get the same error, but the system hangs (interrupted after 10 min). It may be that creating a scatterplot matrix with 1e6 points per frame is just too much.

You might consider taking a random sample:

# all this to create a df with two distinct clusters
set.seed(1)
center.1 <- c(2,2,2,2)
center.2 <- c(-2,-2,-2,-2)
n <- 5e5
f <- function(x){return(data.frame(V1=rnorm(n,mean=x[1]),
                                   V2=rnorm(n,mean=x[2]),
                                   V3=rnorm(n,mean=x[3]),
                                   V4=rnorm(n,mean=x[4])))}
df <- do.call("rbind",lapply(list(center.1,center.2),f))

km <- kmeans(df,2)         # run kmeans on full dataset
df$cluster <- km$cluster   # append cluster column to df

# sample is 10% of population (100,000 rows)
s  <- 1e5
df <- df[sample(nrow(df),s),]
plot(df[,1:4],col=df$cluster)

Running the same thing with a 1% sample (50,000 rows) gives this.

enter image description here

Kmeans on a million observations in R - trouble plotting clusters

Tags:

plot

r

machine-learning

k-means

rstudio

Patthebug

1 Answers

jlhoward

Recent Activity

Donate For Us

Kmeans on a million observations in R - trouble plotting clusters

Tags:

plot

r

machine-learning

k-means

rstudio

Patthebug

1 Answers

jlhoward

Related questions

Recent Activity

Donate For Us