Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Any workaround to clustering mixed data types and render 3D scatter plot in R?

I am trying to see data points distribution within labeled groups in the 3D plot, because I want to see the distribution of the data points and want to see how similar each group of data points in 3D space. To do so, I used scatterplot3d package from CRAN to get 3D to scatter plot, didn't get the correct plot for my data.

reproducible data

Here is the reproducible data that I used.

    > dput(head(phenDat,30))
structure(list(SampleID = c("Tarca_001_P1A01", "Tarca_013_P1B01", 
"Tarca_025_P1C01", "Tarca_037_P1D01", "Tarca_049_P1E01", "Tarca_061_P1F01", 
"Tarca_051_P1E03", "Tarca_063_P1F03", "Tarca_075_P1G03", "Tarca_087_P1H03", 
"Tarca_004_P1A04", "Tarca_064_P1F04", "Tarca_076_P1G04", "Tarca_088_P1H04", 
"Tarca_005_P1A05", "Tarca_017_P1B05", "Tarca_054_P1E06", "Tarca_066_P1F06", 
"Tarca_078_P1G06", "Tarca_090_P1H06", "Tarca_007_P1A07", "Tarca_019_P1B07", 
"Tarca_031_P1C07", "Tarca_079_P1G07", "Tarca_091_P1H07", "Tarca_008_P1A08", 
"Tarca_020_P1B08", "Tarca_022_P1B10", "Tarca_034_P1C10", "Tarca_046_P1D10"
), GA = c(11, 15.3, 21.7, 26.7, 31.3, 32.1, 19.7, 23.6, 27.6, 
30.6, 32.6, 12.6, 18.6, 25.6, 30.6, 36.4, 24.9, 28.9, 36.6, 19.9, 
26.1, 30.1, 36.7, 13.6, 17.6, 22.6, 24.7, 13.3, 19.7, 24.7), 
    Batch = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 5L, 5L, 6L, 
    6L, 6L, 6L), Set = c("PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", 
    "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", 
    "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", 
    "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", 
    "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", "PRB_HTA", 
    "PRB_HTA", "PRB_HTA"), Train = c(1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), Platform = c("HTA20", 
    "HTA20", "HTA20", "HTA20", "HTA20", "HTA20", "HTA20", "HTA20", 
    "HTA20", "HTA20", "HTA20", "HTA20", "HTA20", "HTA20", "HTA20", 
    "HTA20", "HTA20", "HTA20", "HTA20", "HTA20", "GSE113966", "GSE113966", 
    "GSE113966", "GSE113966", "GSE113966", "GSE113966", "GSE113966", "GSE113966", "GSE113966", 
    "GSE113966")), row.names = c(NA, 30L), class = "data.frame")

my attempt:

hclustfunc <- function(x) hclust(x, method="complete")
distfunc <- function(x) as.dist((1-cor(t(x)))/2)
d <- distfunc(persons_df)
fit <- hclustfunc(d)

my updated attempt:

library(rgl)
library(car)
scatter3d(x = PC1, y = PC2, z = PC3, surface = FALSE, groups = as.factor(clusters),  surface.col = cluster.colors, col = cluster.colors, xlab="PC1",ylab="PC2",zlab="PC3")

basically, I want to see data points (a.k.a, rows) that belong to different batch (or group), wanted to color them by some 'group' attribute. I just want to see how data points are similar to each other if we grouped them by different age categories, different batch, and different platform

I am thinking to use kmeans, PCA, other methods can give me different components that can be visualized in 3D plot, but this is not very intuitive to me how to do it in R?

desired plot:

I want to get 3D plot something like this:

enter image description here

can anyone point me out how can I possibly to make this happen? any way to get cluster my data and visualize it in 3D plot in R? Any thoughts? Thanks

update: simplest things might be possible:

I don't want to get too complicated solution in the first place, I just want to group data points (a.k.a, each rows) that belongs to different batch, platform, and age categories (I used findInterval(persons_df$ages, c(10,20,30,40,50))). Any way to make this happen in R?

like image 892
Jerry07 Avatar asked Jul 11 '19 21:07

Jerry07


People also ask

How do you cluster mixed data in R?

The standard way to tackle mixed-type data clustering problems in R is to use either (1) Gower distance (Gower, 1971) via the gower package (van der Loo, 2017) or the daisy(method = "gower") in the cluster package (Maechler et al., 2018); or (2) Hierarchical clustering through hclust() or the agnes() function in ...

How do you cluster mixed data types?

Numerically encode the categorical data before clustering with e.g., k-means or DBSCAN; Use k-prototypes to directly cluster the mixed data; Use FAMD (factor analysis of mixed data) to reduce the mixed data to a set of derived continuous features which can then be clustered.

What is mixed-type data?

Mixed-type data, which contains both categorical and numerical features, is ubiquitous in the real world. It appears in many domains such as in network data [34] with the size of packages (numerical) and protocol type (categorical), and in personal data [26] with gender (categorical) and income information (numerical).


1 Answers

Edit - added k-modes approach for mixed data clustering.

You might also consider plotly for 3d plotting. Here's an example with your data, where I've defined groups for every existing combination of Batch, Platform, and 10 year age bucket. In plotly these are assigned different colors, and you can double-click the group legends to toggle appearance. You'd need to modify for much bigger data, for instance you could remove Platform from the grouping since that's already mapped to z.

library(plotly); library(dplyr); library(RColorBrewer)
age_group = 10
phenDat %>% 
  mutate(group = paste(Batch, Platform, "age", 
                       floor(GA/age_group)*age_group, "-", 
                       floor(GA/age_group)*age_group + age_group - 1)) %>%
plot_ly(x = ~GA, y = ~Set, z = ~Platform, color = ~group) %>%
  add_markers(marker = list(size = 2,
                            color = colorRampPalette(brewer.pal(11,"Spectral"))(10))) %>%
  layout(scene = list(xaxis = list(title = "GA"),
                      yaxis = list(title = "Set"),
                      zaxis = list(title = "Platform")))

enter image description here

As for clustering given the mixed data, here's an approach using the klaR package's kmodes function, which seems to create plausible results here:

phenDat %>%
  bind_cols(cluster = klaR::kmodes(phenDat, 6)[["cluster"]] %>% as.character) %>%
  plot_ly(x = ~GA, y = ~Set, z = ~Platform, color = ~cluster) %>%
  add_markers(marker = list(size = 5)) %>%
  layout(scene = list(xaxis = list(title = "GA"),
                      yaxis = list(title = "Set"),
                      zaxis = list(title = "Platform")))

enter image description here

like image 71
Jon Spring Avatar answered Oct 16 '22 07:10

Jon Spring