Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Proximity Matrix in sklearn.ensemble.RandomForestClassifier

I'm trying to perform clustering in Python using Random Forests. In the R implementation of Random Forests, there is a flag you can set to get the proximity matrix. I can't seem to find anything similar in the python scikit version of Random Forest. Does anyone know if there is an equivalent calculation for the python version?

like image 388
WtLgi Avatar asked Sep 09 '13 16:09

WtLgi


People also ask

How proximity matrix is calculated in random forest?

Proximities are calculated for each pair of cases/observations/sample points. If two cases occupy the same terminal node through one tree, their proximity is increased by one. At the end of the run of all trees, the proximities are normalized by dividing by the number of trees.

What is proximity matrix in random forest?

Once Random F orest has b een t rained, the p roximity matrix q uantifies s ample- similarity. The proximity between two samples is calculated by measuring the number of times that these two samples are placed in the same terminal node of the same tree of RF, divided by the number of trees in the forest.

When using the Randomforestclassifier from Sklearn which parameter do we use to set the number of trees in the forest?

The sub-sample size is controlled with the max_samples parameter if bootstrap=True (default), otherwise the whole dataset is used to build each tree. Read more in the User Guide. The number of trees in the forest. Changed in version 0.22: The default value of n_estimators changed from 10 to 100 in 0.22.


1 Answers

Based on Gilles Louppe answer I have written a function. I don't know if it is effective, but it works. Best regards.

def proximityMatrix(model, X, normalize=True):      

    terminals = model.apply(X)
    nTrees = terminals.shape[1]

    a = terminals[:,0]
    proxMat = 1*np.equal.outer(a, a)

    for i in range(1, nTrees):
        a = terminals[:,i]
        proxMat += 1*np.equal.outer(a, a)

    if normalize:
        proxMat = proxMat / nTrees

    return proxMat   

from sklearn.ensemble import  RandomForestClassifier
from sklearn.datasets import load_breast_cancer
train = load_breast_cancer()

model = RandomForestClassifier(n_estimators=500, max_features=2, min_samples_leaf=40)
model.fit(train.data, train.target)
proximityMatrix(model, train.data, normalize=True)
## array([[ 1.   ,  0.414,  0.77 , ...,  0.146,  0.79 ,  0.002],
##        [ 0.414,  1.   ,  0.362, ...,  0.334,  0.296,  0.008],
##        [ 0.77 ,  0.362,  1.   , ...,  0.218,  0.856,  0.   ],
##        ..., 
##        [ 0.146,  0.334,  0.218, ...,  1.   ,  0.21 ,  0.028],
##        [ 0.79 ,  0.296,  0.856, ...,  0.21 ,  1.   ,  0.   ],
##        [ 0.002,  0.008,  0.   , ...,  0.028,  0.   ,  1.   ]])
like image 185
Vyga Avatar answered Sep 21 '22 11:09

Vyga