Unsupervised Random Forest Proximities in Python

Question

I am currently re-visiting a random forests project I performed a few years back using the R-language, to:

generate a proximity matrix of the data inputs using unsupervised RandomForest
calculate the distance matrix from this proximity matrix and pass to Partitioning Around Medoids (PAM) clustering algorithm
using the clusters obtained through PAM, run RandomForest in supervised mode to train a new model.
Use this model to predict using another dataset from a future point in time.

I have shifted my workflow to Python for much of many projects as the language is very flexible and fun, but I am still getting my bearings in sklearn as compared to how I performed such tasks in R. My hangup is in producing a proximity matrix (or some container holding the proximity between samples), to be passed to PAM. I have found the following post, which describes a similar issue, but I have been unable to find a way to implement what the accepted answer's author suggests.

Any clues as to how to implement this? Any help is be greatly appreciated, and I will be sure to return that to the larger community. I know there are lots of other R to Python converts out there who would benefit from this sort of information.

Thanks in advance and apologies if this is a simple solution that I am simply overlooking.

Soroosh · Accepted Answer

You can use bigrf package written in R. ( https://cran.r-project.org/web/packages/bigrf/bigrf.pdf ) It has whatever you need.

That is how you can implement it in R:

# load bigrf library
library('bigrf')

# generate synthetic dataset
synthetic.df <- generateSyntheticClass(x)

# create rf model
forest <- bigrfc(synthetic.df$x, synthetic.df$y, trace = 1)

# calculate distances
dist  <- proximities(forest, trace =  2)
dist  <- data.frame(as.matrix(dist))
dist  <- dist[1:nrow(x), 1:nrow(x)]
dist  <- sqrt(1 - dist)

Unsupervised Random Forest Proximities in Python

Tags:

python

cluster-analysis

random-forest

Michael Lindgren

1 Answers

Soroosh

Recent Activity

Donate For Us

Unsupervised Random Forest Proximities in Python

Tags:

python

cluster-analysis

random-forest

Michael Lindgren

1 Answers

Soroosh

Related questions

Recent Activity

Donate For Us