Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Proximity Matrix - Random Forest , R

I am using the randomForest package in R, which allows to calculate the proximity matrix (P). In the description of the package it describes the parameter as: "if proximity=TRUE when randomForest is called, a matrix of proximity measures among the input (based on the frequency that pairs of data points are in the same terminal nodes)."

I obtain the proximity matrix of a random forest as follows:

P <- randomForest(x, y, ntree = 1000, proximity=TRUE)$proximity

When I investigate the P matrix, I see values like P(i,j)=0.971014493 where i and j are two data instances within my training data set (x). Such a value does not make sense, because when it is multplied by 1000 (number of trees in the forest), the resulting number is not an integer, hence "frequency". Could someone please help me understand, why do I get such real numbers in the proximity matrix?

like image 392
banbar Avatar asked May 20 '14 14:05

banbar


People also ask

What is proximity in random forest in R?

The term "proximity" means the "closeness" or "nearness" between pairs of cases. Proximities are calculated for each pair of cases/observations/sample points. If two cases occupy the same terminal node through one tree, their proximity is increased by one.

How proximity matrix is calculated in random forest?

The proximity between two samples is calculated by measuring the number of times that these two samples are placed in the same terminal node of the same tree of RF, divided by the number of trees in the forest.

What is the use of proximity matrix in the Random Forest algorithm?

Proximities are used in replacing missing data, locating outliers, and producing illuminating low-dimensional views of the data.


1 Answers

Because just as with the default predictions, the default proximity is calculated only using the trees where neither observation was included in the sample used to build that tree (they were "out-of-bag").

The number of times this happens will vary slightly for each pair of cases, and certainly won't be a nice round number like 1000.

You'll note that the very next parameter listed after proximity is called oob.prox indicating whether to use only out of bag pairs (the default) or use each and every tree.

like image 113
joran Avatar answered Nov 10 '22 00:11

joran