Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python hierarchical clustering with missing values

I am new to Python. I would like to perform hierarchical clustering on N by P dataset that contains some missing values. I am planning to use scipy.cluster.hierarchy.linkage function that takes distance matrix in condensed form. Does Python have a method to compute distance matrix for missing value contained data? (In R dist function automatically takes care of missing values... but scipy.spatial.distance.pdist seems not handling missing values!)

like image 212
FairyOnIce Avatar asked Oct 28 '25 16:10

FairyOnIce


1 Answers

I could not find a method to compute distance matrix for data with missing values. So here is my naive solution using Euclidean distance.

import numpy as np
def getMissDist(x,y):
    return np.nanmean( (x - y)**2 )

def getMissDistMat(dat):
    Npat = dat.shape[0]
    dist = np.ndarray(shape=(Npat,Npat))
    dist.fill(0)
    for ix in range(0,Npat):
        x = dat[ix,]
        if ix >0:
            for iy in range(0,ix):
                y = dat[iy,]
                dist[ix,iy] = getMissDist(x,y)
                dist[iy,ix] = dist[ix,iy]
    return dist

Then assume that dat is N (= number of cases) by P (=number of features) data matrix with missing values then one can perform hierarchical clustering on this dat as:

distMat = getMissDistMat(dat)
condensDist = dist.squareform(distMat)
link = hier.linkage(condensDist, method='average')
like image 119
FairyOnIce Avatar answered Oct 31 '25 06:10

FairyOnIce



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!