I was trying to draw a hierarchical clustering of some samples (40 of them) over some features(genes) and I have a big table with 500k rows and 41 columns (1st one is name) and when I tried
d<-dist(as.matrix(file),method="euclidean")
I got this error
Error: cannot allocate vector of size 1101.1 Gb
How can I get around of this limitation? I googled it and came across to the ff package in R but I don't quite understand whether that could solve my issue.
Thanks!
Example 1: Compute Euclidean Distance Using Default Specifications of dist() Function. Have a look at the output of the RStudio console. It shows the distances of each combination of our data rows. Note that the dist function computes the Euclidean Distance by default.
The dist() function in R computes the distance between the given rows of an input data matrix. It returns a distance matrix computed by any of the specified methods for measuring distance.
In R, the dist() function is used to compute a distance matrix. But the result you get back isn't really a matrix, it's a "dist" object. Under the hood, the "dist" object is stored as a simple vector. When it's printed out, R knows how to make it look like a matrix.
For computing distance matrix by GPU in R programming, we can use the dist() function. dist() function computes and returns the distance matrix computed by using the specified distance measure to compute the distances between the rows of a data matrix.
Generally speaking hierarchical clustering is not the best approach for dealing with very large datasets.
In your case however there is a different problem. If you want to cluster samples structure of your data is wrong. Observations should be represented as the rows, and gene expression (or whatever kind of data you have) as the columns.
Lets assume you have data like this:
data <- as.data.frame(matrix(rnorm(n=500000*40), ncol=40))
What you want to do is:
# Create transposed data matrix
data.matrix.t <- t(as.matrix(data))
# Create distance matrix
dists <- dist(data.matrix.t)
# Clustering
hcl <- hclust(dists)
# Plot
plot(hcl)
NOTE
You should remember that euclidean distances can be rather misleading when you work with high-dimensional data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With