Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Kmeans on a million observations in R - trouble plotting clusters

I am trying to perform KMeans clustering on over a million rows with 4 observations, all numeric. I am using the following code:

kmeansdf<-as.data.frame(rbind(train$V3,train$V5,train$V8,train$length))
km<-kmeans(kmeansdf,2)

As it can be seen, I would like to divide my data into two clusters. The object km is getting populated but I am having trouble plotting the results. Here is the code I am using to plot:

plot(kmeansdf,col=km$cluster)

This piece of code gives me the following error:

Error in plot.new() : figure margins too large

I tried researching online but could not find a solution, I tried working on command line as well but still getting the same error (I am using RStudio at the moment)

Any help to resolve the error would be highly appreciated. TIA.

like image 942
Patthebug Avatar asked Nov 02 '22 09:11

Patthebug


1 Answers

When I run your code on a df with 1e6 rows, I don't get the same error, but the system hangs (interrupted after 10 min). It may be that creating a scatterplot matrix with 1e6 points per frame is just too much.

You might consider taking a random sample:

# all this to create a df with two distinct clusters
set.seed(1)
center.1 <- c(2,2,2,2)
center.2 <- c(-2,-2,-2,-2)
n <- 5e5
f <- function(x){return(data.frame(V1=rnorm(n,mean=x[1]),
                                   V2=rnorm(n,mean=x[2]),
                                   V3=rnorm(n,mean=x[3]),
                                   V4=rnorm(n,mean=x[4])))}
df <- do.call("rbind",lapply(list(center.1,center.2),f))

km <- kmeans(df,2)         # run kmeans on full dataset
df$cluster <- km$cluster   # append cluster column to df

# sample is 10% of population (100,000 rows)
s  <- 1e5
df <- df[sample(nrow(df),s),]
plot(df[,1:4],col=df$cluster)

Running the same thing with a 1% sample (50,000 rows) gives this.

enter image description here

like image 73
jlhoward Avatar answered Nov 04 '22 04:11

jlhoward