clustering very large dataset in R

Tags:

I have a dataset consisting of 70,000 numeric values representing distances ranging from 0 till 50, and I want to cluster these numbers; however, if I'm trying the classical clustering approach, then I would have to establish a 70,000X70,000 distance matrix representing the distances between each two numbers in my dataset, which won't fit in memory, so I was wondering if there is any smart way to solve this problem without the need to do stratified sampling? I also tried bigmemory and big analytics libraries in R but still can't fit the data into memory

295

asked Feb 24 '14 10:02

DOSMarter

2 Answers

70000 is not large. It's not small, but it's also not particularly large... The problem is the limited scalability of matrix-oriented approaches.

But there are plenty of clustering algorithms which do not use matrixes and do no need O(n^2) (or even worse, O(n^3)) runtime.

You may want to try ELKI, which has great index support (try the R*-tree with SortTimeRecursive bulk loading). The index support makes it a lot lot lot faster.

If you insist on using R, give at least kmeans a try and the fastcluster package. K-means has runtime complexity O(n*k*i) (where k is the parameter k, and i is the number of iterations); fastcluster has an O(n) memory and O(n^2) runtime implementation of single-linkage clustering comparable to the SLINK algorithm in ELKI. (The R "agnes" hierarchical clustering will use O(n^3) runtime and O(n^2) memory).

Implementation matters. Often, implementations in R aren't the best IMHO, except for core R which usually at least has a competitive numerical precision. But R was built by statisticians, not by data miners. It's focus is on statistical expressiveness, not on scalability. So the authors aren't to blame. It's just the wrong tool for large data.

Oh, and if your data is 1-dimensional, don't use clustering at all. Use kernel density estimation. 1 dimensional data is special: it's ordered. Any good algorithm for breaking 1-dimensional data into inverals should exploit that you can sort the data.

184

answered Oct 17 '22 22:10

Has QUIT--Anony-Mousse

You can use kmeans, which normally suitable for this amount of data, to calculate an important number of centers (1000, 2000, ...) and perform a hierarchical clustering approach on the coordinates of these centers.Like this the distance matrix will be smaller.

## Example
# Data
x <- rbind(matrix(rnorm(70000, sd = 0.3), ncol = 2),
           matrix(rnorm(70000, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")

# CAH without kmeans : dont work necessarily
library(FactoMineR)
cah.test <- HCPC(x, graph=FALSE, nb.clust=-1)

# CAH with kmeans : work quickly
cl <- kmeans(x, 1000, iter.max=20)
cah <- HCPC(cl$centers, graph=FALSE, nb.clust=-1)
plot.HCPC(cah, choice="tree")

answered Oct 17 '22 22:10

Victorp

Related questions
                            
                                How to replace one substring with different substrings in R?
                            
                                How `poly()` generates orthogonal polynomials? How to understand the "coefs" returned?
                            
                                R convert large character string to dataframe
                            
                                How to compute the mean survival time
                            
                                Rename multiple columns given character vectors of column names and replacement [duplicate]
                            
                                R and data.table on AWS
                            
                                Removing holes from polygons in R sf
                            
                                Map dplyr function to each combination of variable pairs in an R dataframe
                            
                                Python in R - Error: could not find a Python environment for /usr/bin/python
                            
                                Calculate Returns over Period of Time
                            
                                Make a table of string frequency
                            
                                R machine learning packages to deal with factors with a large number of levels
                            
                                In R, how to use regex [:punct:] in gsub?
                            
                                How to create a variable of rownames?
                            
                                Downloading Live Olympic Medal Data into R
                            
                                Speedup conversion of 2 million rows of date strings to POSIX.ct
                            
                                Saving a graph with ggsave after using ggplot_build and ggplot_gtable
                            
                                Complete.obs of cor() function
                            
                                How do I predict new data's cluster after clustering training data?
                            
                                How to change the last value in each group by reference, in data.table

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

clustering very large dataset in R

Tags:

r

machine-learning

cluster-analysis

data-mining

bigdata

DOSMarter

People also ask

2 Answers

Has QUIT--Anony-Mousse

Victorp

Recent Activity

Donate For Us