Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python equivalent of daisy() in the cluster package of R

I have a dataset that contains both categorical (nominal and ordinal) and numerical attributes. I want to calculate the (dis)similarity matrix across my observations using these mixed attributes. Using the daisy() function of the cluster package in R, I can easily get a dissimilarity matrix as follows:

if(!require("cluster")) { install.packages("cluster");  require("cluster") }
data(flower)
as.matrix(daisy(flower, metric = "gower"))

This uses the gower metric to deal with the nominal variables. Is there a Python equivalent of the daisy() function in R?

Or maybe any other module function that allows using the Gower metric or something similar to calculate the (dis)similarity matrix for a dataset with mixed (nominal, numeric) attributes?

like image 628
Zhubarb Avatar asked Oct 15 '14 16:10

Zhubarb


People also ask

What is Daisy function r?

Use R function daisy() from package cluster to compute a Gower dissimilarity (distance) matrix between the data records, and refer to the result as “Dist” # Library call library(cluster) #daisy(crx, metric = "gower", stand = FALSE, type = list(), weights = rep.int(1, p), warnBin = warnType, warnAsym = warnType, ...

What is another name of dissimilarity matrix?

The dissimilarity matrix (also called distance matrix) describes pairwise distinction between M objects. It is a square symmetrical MxM matrix with the (ij)th element equal to the value of a chosen measure of distinction between the (i)th and the (j)th object.


2 Answers

Just to implement a Gower function to use with pdist won´t be enough.

Internally the pdist makes several numerical transformations that will fail if you use a matrix with mixed data.

I implemented the Gower function, according the original paper, and the respective adptations necessary in the pdist module (I could not simply override the functions, because the defs in the pdist module are private).

The results I obtained with this so far are the same from R´s daisy function.

The source code is avilable at this jupyter notebook: https://sourceforge.net/projects/gower-distance-4python/files/

like image 185
Marcelo Beckmann Avatar answered Oct 18 '22 17:10

Marcelo Beckmann


I believe you are looking for scipy.spatial.distance.pdist.

If you implement a function that computes the Gower distance on a single pair of observations, you can pass that function to pdist and it will apply it pairwise and return the resulting matrix of pairwise distances. It does not appear that the Gower distance is one of the built-in options.

Likewise, if a single observation has mixed attributes, you can just define your own function which, say, uses something like the Euclidean distance on the subset of numerical attributes, a Gower distance on the subset of categorical attributes, and adds them -- or any other implementation of what it means to you, for your application, to compute the distance between two isolated observations.

For clustering in Python, usually you want to work with scikits.learn and this question and answer page discusses exactly this problem of using a custom distance measure (in your case Gower) with scikits -- which does not appear possible.

You could use one of the choices provided by pdist along with the implementation at that linked answer page -- or you could implement a function for the Gower similarity and use that. But if you want the out-of-the-box clustering tools from scikits, it does not appear to be directly possible.

like image 10
ely Avatar answered Oct 18 '22 18:10

ely