Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

clustering very large dataset in R

I have a dataset consisting of 70,000 numeric values representing distances ranging from 0 till 50, and I want to cluster these numbers; however, if I'm trying the classical clustering approach, then I would have to establish a 70,000X70,000 distance matrix representing the distances between each two numbers in my dataset, which won't fit in memory, so I was wondering if there is any smart way to solve this problem without the need to do stratified sampling? I also tried bigmemory and big analytics libraries in R but still can't fit the data into memory

like image 295
DOSMarter Avatar asked Feb 24 '14 10:02

DOSMarter


People also ask

Which kind of clustering algorithm is better for very large datasets?

CLARA (clustering large applications.) It is a sample-based method that randomly selects a small subset of data points instead of considering the whole observations, which means that it works well on a large dataset.

Which clustering can handle big data?

According to this research, k-means method is regarded as a viable approach for certain applications of big data clustering and has attracted many researchers than any other techniques.

Is K means clustering good for large datasets?

K-Means which is one of the most used clustering methods and K-Means based on MapReduce is considered as an advanced solution for very large dataset clustering. However, the executing time is still an obstacle due to the increasing number of iterations when there is an increase of dataset size and number of clusters.

Can R handle large data sets?

As a rule of thumb: Data sets that contain up to one million records can easily processed with standard R. Data sets with about one million to one billion records can also be processed in R, but need some additional effort.


2 Answers

70000 is not large. It's not small, but it's also not particularly large... The problem is the limited scalability of matrix-oriented approaches.

But there are plenty of clustering algorithms which do not use matrixes and do no need O(n^2) (or even worse, O(n^3)) runtime.

You may want to try ELKI, which has great index support (try the R*-tree with SortTimeRecursive bulk loading). The index support makes it a lot lot lot faster.

If you insist on using R, give at least kmeans a try and the fastcluster package. K-means has runtime complexity O(n*k*i) (where k is the parameter k, and i is the number of iterations); fastcluster has an O(n) memory and O(n^2) runtime implementation of single-linkage clustering comparable to the SLINK algorithm in ELKI. (The R "agnes" hierarchical clustering will use O(n^3) runtime and O(n^2) memory).

Implementation matters. Often, implementations in R aren't the best IMHO, except for core R which usually at least has a competitive numerical precision. But R was built by statisticians, not by data miners. It's focus is on statistical expressiveness, not on scalability. So the authors aren't to blame. It's just the wrong tool for large data.

Oh, and if your data is 1-dimensional, don't use clustering at all. Use kernel density estimation. 1 dimensional data is special: it's ordered. Any good algorithm for breaking 1-dimensional data into inverals should exploit that you can sort the data.

like image 184
Has QUIT--Anony-Mousse Avatar answered Oct 17 '22 22:10

Has QUIT--Anony-Mousse


You can use kmeans, which normally suitable for this amount of data, to calculate an important number of centers (1000, 2000, ...) and perform a hierarchical clustering approach on the coordinates of these centers.Like this the distance matrix will be smaller.

## Example
# Data
x <- rbind(matrix(rnorm(70000, sd = 0.3), ncol = 2),
           matrix(rnorm(70000, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")

# CAH without kmeans : dont work necessarily
library(FactoMineR)
cah.test <- HCPC(x, graph=FALSE, nb.clust=-1)

# CAH with kmeans : work quickly
cl <- kmeans(x, 1000, iter.max=20)
cah <- HCPC(cl$centers, graph=FALSE, nb.clust=-1)
plot.HCPC(cah, choice="tree")
like image 23
Victorp Avatar answered Oct 17 '22 22:10

Victorp