I am trying to see data points distribution within labeled groups in the 3D plot, because I want to see the distribution of the data points and want to see how similar each group of data points in 3D space. To do so, I used scatterplot3d package from CRAN to get 3D to scatter plot, didn't get the correct plot for my data.
reproducible data
Here is the reproducible data that I used.
> dput(head(phenDat,30))
structure(list(SampleID = c("Tarca_001_P1A01", "Tarca_013_P1B01",
"Tarca_025_P1C01", "Tarca_037_P1D01", "Tarca_049_P1E01", "Tarca_061_P1F01",
"Tarca_051_P1E03", "Tarca_063_P1F03", "Tarca_075_P1G03", "Tarca_087_P1H03",
"Tarca_004_P1A04", "Tarca_064_P1F04", "Tarca_076_P1G04", "Tarca_088_P1H04",
"Tarca_005_P1A05", "Tarca_017_P1B05", "Tarca_054_P1E06", "Tarca_066_P1F06",
"Tarca_078_P1G06", "Tarca_090_P1H06", "Tarca_007_P1A07", "Tarca_019_P1B07",
"Tarca_031_P1C07", "Tarca_079_P1G07", "Tarca_091_P1H07", "Tarca_008_P1A08",
"Tarca_020_P1B08", "Tarca_022_P1B10", "Tarca_034_P1C10", "Tarca_046_P1D10"
), GA = c(11, 15.3, 21.7, 26.7, 31.3, 32.1, 19.7, 23.6, 27.6,
30.6, 32.6, 12.6, 18.6, 25.6, 30.6, 36.4, 24.9, 28.9, 36.6, 19.9,
26.1, 30.1, 36.7, 13.6, 17.6, 22.6, 24.7, 13.3, 19.7, 24.7),
Batch = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 5L, 5L, 6L,
6L, 6L, 6L), Set = c("PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA",
"PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA",
"PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA",
"PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA",
"PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA",
"PRB_HTA", "PRB_HTA"), Train = c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), Platform = c("HTA20",
"HTA20", "HTA20", "HTA20", "HTA20", "HTA20", "HTA20", "HTA20",
"HTA20", "HTA20", "HTA20", "HTA20", "HTA20", "HTA20", "HTA20",
"HTA20", "HTA20", "HTA20", "HTA20", "HTA20", "GSE113966", "GSE113966",
"GSE113966", "GSE113966", "GSE113966", "GSE113966", "GSE113966", "GSE113966", "GSE113966",
"GSE113966")), row.names = c(NA, 30L), class = "data.frame")
my attempt:
hclustfunc <- function(x) hclust(x, method="complete")
distfunc <- function(x) as.dist((1-cor(t(x)))/2)
d <- distfunc(persons_df)
fit <- hclustfunc(d)
my updated attempt:
library(rgl)
library(car)
scatter3d(x = PC1, y = PC2, z = PC3, surface = FALSE, groups = as.factor(clusters), surface.col = cluster.colors, col = cluster.colors, xlab="PC1",ylab="PC2",zlab="PC3")
basically, I want to see data points (a.k.a, rows) that belong to different batch (or group), wanted to color them by some 'group' attribute. I just want to see how data points are similar to each other if we grouped them by different age categories, different batch, and different platform
I am thinking to use kmeans, PCA, other methods can give me different components that can be visualized in 3D plot, but this is not very intuitive to me how to do it in R?
desired plot:
I want to get 3D plot something like this:
can anyone point me out how can I possibly to make this happen? any way to get cluster my data and visualize it in 3D plot in R? Any thoughts? Thanks
update: simplest things might be possible:
I don't want to get too complicated solution in the first place, I just want to group data points (a.k.a, each rows) that belongs to different batch, platform, and age categories (I used findInterval(persons_df$ages, c(10,20,30,40,50))
). Any way to make this happen in R?
The standard way to tackle mixed-type data clustering problems in R is to use either (1) Gower distance (Gower, 1971) via the gower package (van der Loo, 2017) or the daisy(method = "gower") in the cluster package (Maechler et al., 2018); or (2) Hierarchical clustering through hclust() or the agnes() function in ...
Numerically encode the categorical data before clustering with e.g., k-means or DBSCAN; Use k-prototypes to directly cluster the mixed data; Use FAMD (factor analysis of mixed data) to reduce the mixed data to a set of derived continuous features which can then be clustered.
Mixed-type data, which contains both categorical and numerical features, is ubiquitous in the real world. It appears in many domains such as in network data [34] with the size of packages (numerical) and protocol type (categorical), and in personal data [26] with gender (categorical) and income information (numerical).
Edit - added k-modes approach for mixed data clustering.
You might also consider plotly
for 3d plotting. Here's an example with your data, where I've defined groups for every existing combination of Batch, Platform, and 10 year age bucket. In plotly these are assigned different colors, and you can double-click the group legends to toggle appearance. You'd need to modify for much bigger data, for instance you could remove Platform from the grouping since that's already mapped to z.
library(plotly); library(dplyr); library(RColorBrewer)
age_group = 10
phenDat %>%
mutate(group = paste(Batch, Platform, "age",
floor(GA/age_group)*age_group, "-",
floor(GA/age_group)*age_group + age_group - 1)) %>%
plot_ly(x = ~GA, y = ~Set, z = ~Platform, color = ~group) %>%
add_markers(marker = list(size = 2,
color = colorRampPalette(brewer.pal(11,"Spectral"))(10))) %>%
layout(scene = list(xaxis = list(title = "GA"),
yaxis = list(title = "Set"),
zaxis = list(title = "Platform")))
As for clustering given the mixed data, here's an approach using the klaR
package's kmodes
function, which seems to create plausible results here:
phenDat %>%
bind_cols(cluster = klaR::kmodes(phenDat, 6)[["cluster"]] %>% as.character) %>%
plot_ly(x = ~GA, y = ~Set, z = ~Platform, color = ~cluster) %>%
add_markers(marker = list(size = 5)) %>%
layout(scene = list(xaxis = list(title = "GA"),
yaxis = list(title = "Set"),
zaxis = list(title = "Platform")))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With