clustering with NA values in R

Question

I was surprised to find out that clara from library(cluster) allows NAs. But function documentation says nothing about how it handles these values.

So my questions are:

How clara handles NAs?
Can this be somehow used for kmeans (Nas not allowed)?

[Update] So I did found lines of code in clara function:

inax <- is.na(x)
valmisdat <- 1.1 * max(abs(range(x, na.rm = TRUE)))
x[inax] <- valmisdat

which do missing value replacement by valmisdat. Not sure I understand the reason to use such formula. Any ideas? Would it be more "natural" to treat NAs by each column separately, maybe replacing with mean/median?

Gavin Simpson · Accepted Answer

Although not stated explicitly, I believe that NA are handled in the manner described in the ?daisy help page. The Details section has:

In the daisy algorithm, missing values in a row of x are not included in the dissimilarities involving that row.

Given internally the same code will be being used by clara() that is how I understand that NAs in the data can be handled - they just don't take part in the computation. This is a reasonably standard way of proceeding in such cases and is for example used in the definition of Gower's generalised similarity coefficient.

Update The C sources for clara.c clearly indicate that this (the above) is how NAs are handled by clara() (lines 350-356 in ./src/clara.c):

    if (has_NA && jtmd[j] < 0) { /* x[,j] has some Missing (NA) */
        /* in the following line (Fortran!), x[-2] ==> seg.fault
           {BDR to R-core, Sat, 3 Aug 2002} */
        if (x[lj] == valmd[j] || x[kj] == valmd[j]) {
        continue /* next j */;
        }
    }

data-frame-gg · Answer

Not sure if kmeans can handle missing data by ignoring the missing values in a row.

There are two steps in kmeans;

calculating the distance between an observation and original cluster mean.
updating the new cluster mean based on the newly calculated distances.

When we have missing data in our observations: Step 1 can be handled by adjusting the distance metric appropriately as in the clara/pam/daisy package. But Step 2 can only be performed if we have some value for each column of an observation. Therefore imputing might be the next best option for kmeans to deal missing data.

clustering with NA values in R

Tags:

r

cluster-analysis

danas.zuokas

2 Answers

Gavin Simpson

data-frame-gg

Recent Activity

Donate For Us

clustering with NA values in R

Tags:

r

cluster-analysis

danas.zuokas

2 Answers

Gavin Simpson

data-frame-gg

Related questions

Recent Activity

Donate For Us