Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

parallel k-means in R

I am trying to understand how to parallelize some of my code using R. So, in the following example I want to use k-means to cluster data using 2,3,4,5,6 centers, while using 20 iterations. Here is the code:

library(parallel)
library(BLR)

data(wheat)

parallel.function <- function(i) {
    kmeans( X[1:100,100], centers=?? , nstart=i )
}

out <- mclapply( c(5, 5, 5, 5), FUN=parallel.function )

How can we parallel simultaneously the iterations and the centers? How to track the outputs, assuming I want to keep all the outputs from k-means across all, iterations and centers, just to learn how?

like image 866
hema Avatar asked Dec 06 '13 05:12

hema


People also ask

What is parallel K clustering?

• k-means clustering is a method of clustering which aims to partition n data points into k clusters (n >> k) in which each observation belongs to the cluster with the nearest mean. • The nearness is calculated by distance function which is mostly Euclidian distance or Manhattan distance. •

What is parallel processing in R?

Parallel processing (in the extreme) means that all the f# processes start simultaneously and run to completion on their own. If we have a single computer at our disposal and have to run n models, each taking s seconds, the total running time will be n*s .

Does R support parallel computing?

There are various packages in R which allow parallelization. “parallel” Package The parallel package in R can perform tasks in parallel by providing the ability to allocate cores to R. The working involves finding the number of cores in the system and allocating all of them or a subset to make a cluster.

What is Kmean R?

K Means Clustering in R Programming is an Unsupervised Non-linear algorithm that cluster data based on similarity or similar groups. It seeks to partition the observations into a pre-specified number of clusters. Segmentation of data takes place to assign each training example to a segment called a cluster.


1 Answers

This looked very simple to me at first ... and then i tried it. After a lot of monkey typing and face palming during my lunch break however, I arrived at this:

library(parallel)
library(BLR)

data(wheat)

mc = mclapply(2:6, function(x,centers)kmeans(x, centers), x=X)

It looks right though I didn't check how sensible the clustering was.

> summary(mc)
     Length Class  Mode
[1,] 9      kmeans list
[2,] 9      kmeans list
[3,] 9      kmeans list
[4,] 9      kmeans list
[5,] 9      kmeans list

On reflection the command syntax seems sensible - although a lot of other stuff that failed seemed reasonable too...The examples in the help documentation are maybe not that great.

Hope it helps.

EDIT As requested here is that on two variables nstart and centers

(pars = expand.grid(i=1:3, cent=2:4))

  i cent
1 1    2
2 2    2
3 3    2
4 1    3
5 2    3
6 3    3
7 1    4
8 2    4
9 3    4

L=list()
# zikes horrible
pars2=apply(pars,1,append, L)
mc = mclapply(pars2, function(x,pars)kmeans(x, centers=pars$cent,nstart=pars$i ), x=X)

> summary(mc)
      Length Class  Mode
 [1,] 9      kmeans list
 [2,] 9      kmeans list
 [3,] 9      kmeans list
 [4,] 9      kmeans list
 [5,] 9      kmeans list
 [6,] 9      kmeans list
 [7,] 9      kmeans list
 [8,] 9      kmeans list
 [9,] 9      means list

How'd you like them apples?

like image 61
Stephen Henderson Avatar answered Nov 06 '22 21:11

Stephen Henderson