I was trying to draw a hierarchical clustering of some samples (40 of them) over some features(genes) and I have a big table with 500k rows and 41 columns (1st one is name) and when I tried <pre class="prettyprint"><code>d<-dist(as.matrix(file),method="euclidean") </code></pre> I got this error <pre class="prettyprint"><code>Error: cannot allocate vector of size 1101.1 Gb </code></pre> How can I get around of this limitation? I googled it and came across to the ff package in R but I don't quite understand whether that could solve my issue. Thanks!

Generally speaking hierarchical clustering is not the best approach for dealing with very large datasets. In your case however there is a different problem. If you want to cluster samples structure of your data is wrong. Observations should be represented as the rows, and gene expression (or whatever kind of data you have) as the columns. Lets assume you have data like this: <pre class="prettyprint"><code>data <- as.data.frame(matrix(rnorm(n=500000*40), ncol=40)) </code></pre> What you want to do is: <pre class="prettyprint"><code> # Create transposed data matrix data.matrix.t <- t(as.matrix(data)) # Create distance matrix dists <- dist(data.matrix.t) # Clustering hcl <- hclust(dists) # Plot plot(hcl) </code></pre> NOTE You should remember that euclidean distances can be rather misleading when you work with high-dimensional data.

dist() function in R: vector size limitation

Tags:

r

cluster-analysis

I was trying to draw a hierarchical clustering of some samples (40 of them) over some features(genes) and I have a big table with 500k rows and 41 columns (1st one is name) and when I tried

d<-dist(as.matrix(file),method="euclidean")

I got this error

Error: cannot allocate vector of size 1101.1 Gb

How can I get around of this limitation? I googled it and came across to the ff package in R but I don't quite understand whether that could solve my issue.

Thanks!

529

asked Oct 17 '13 20:10

olala

1 Answers

Generally speaking hierarchical clustering is not the best approach for dealing with very large datasets.

In your case however there is a different problem. If you want to cluster samples structure of your data is wrong. Observations should be represented as the rows, and gene expression (or whatever kind of data you have) as the columns.

Lets assume you have data like this:

data <- as.data.frame(matrix(rnorm(n=500000*40), ncol=40))

What you want to do is:

 # Create transposed data matrix
 data.matrix.t <- t(as.matrix(data))

 # Create distance matrix
 dists <- dist(data.matrix.t)

 # Clustering
 hcl <- hclust(dists)

 # Plot
 plot(hcl)

NOTE

You should remember that euclidean distances can be rather misleading when you work with high-dimensional data.

answered Oct 11 '22 11:10

zero323

Related questions
                            
                                Install particular version(2.15.2) of r-base on ubuntu
                            
                                Understand and avoid infinite recursion R
                            
                                lubridate errors in R
                            
                                write.xlsx outputting merged cells directly from R
                            
                                How to build a layered plot step by step using grid in knitr?
                            
                                Nested lists: how to define the size before entering data
                            
                                Graph Visualization with igraph and R
                            
                                How to call a function that returns multiple rows and columns in a data.table?
                            
                                Function `dist` not behaving as expected on vectors with missing values
                            
                                R - How to create a function that accepts a code block as parameter?
                            
                                Finding ranges in runs of numbers
                            
                                Plotting elements from shapefiles in R
                            
                                Undefined slot classes in definition?
                            
                                R: split only when special regex condition doesn't match
                            
                                Fit a mixture of von Mises distributions in R
                            
                                How to resample a raster snapping to an existing grid?
                            
                                How can I plot a image with (x,y,r,g,b) coordinates using ggplot2?
                            
                                read.table function and stdin
                            
                                Odd behaviour of the by() function in R 3.0.0?
                            
                                Export result of glht to LaTeX in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With