Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

clustering with NA values in R

I was surprised to find out that clara from library(cluster) allows NAs. But function documentation says nothing about how it handles these values.

So my questions are:

  1. How clara handles NAs?
  2. Can this be somehow used for kmeans (Nas not allowed)?

[Update] So I did found lines of code in clara function:

inax <- is.na(x)
valmisdat <- 1.1 * max(abs(range(x, na.rm = TRUE)))
x[inax] <- valmisdat

which do missing value replacement by valmisdat. Not sure I understand the reason to use such formula. Any ideas? Would it be more "natural" to treat NAs by each column separately, maybe replacing with mean/median?

like image 906
danas.zuokas Avatar asked May 23 '12 13:05

danas.zuokas


2 Answers

Although not stated explicitly, I believe that NA are handled in the manner described in the ?daisy help page. The Details section has:

In the daisy algorithm, missing values in a row of x are not included in the dissimilarities involving that row.

Given internally the same code will be being used by clara() that is how I understand that NAs in the data can be handled - they just don't take part in the computation. This is a reasonably standard way of proceeding in such cases and is for example used in the definition of Gower's generalised similarity coefficient.

Update The C sources for clara.c clearly indicate that this (the above) is how NAs are handled by clara() (lines 350-356 in ./src/clara.c):

    if (has_NA && jtmd[j] < 0) { /* x[,j] has some Missing (NA) */
        /* in the following line (Fortran!), x[-2] ==> seg.fault
           {BDR to R-core, Sat, 3 Aug 2002} */
        if (x[lj] == valmd[j] || x[kj] == valmd[j]) {
        continue /* next j */;
        }
    }
like image 167
Gavin Simpson Avatar answered Oct 24 '22 07:10

Gavin Simpson


Not sure if kmeans can handle missing data by ignoring the missing values in a row.

There are two steps in kmeans;

  1. calculating the distance between an observation and original cluster mean.
  2. updating the new cluster mean based on the newly calculated distances.

When we have missing data in our observations: Step 1 can be handled by adjusting the distance metric appropriately as in the clara/pam/daisy package. But Step 2 can only be performed if we have some value for each column of an observation. Therefore imputing might be the next best option for kmeans to deal missing data.

like image 30
data-frame-gg Avatar answered Oct 24 '22 08:10

data-frame-gg