Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Random forest output interpretation

I have run a random forest for my data and got the output in the form of a matrix. What are the rules it applied to classify?

P.S. I want a profile of the customer as output, e.g. Person from New York, works in the technology industry, etc.

How can I interpret the results from a random forest?

like image 736
user2061730 Avatar asked Feb 21 '13 07:02

user2061730


People also ask

How do you interpret random forest models?

One way of getting an insight into a random forest is to compute feature importances, either by permuting the values of each feature one by one and checking how it changes the model performance or computing the amount of “impurity” (typically variance in case of regression trees and gini coefficient or entropy in case ...

What is the output of a random forest?

For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the mean or average prediction of the individual trees is returned. Random decision forests correct for decision trees' habit of overfitting to their training set.

How do you interpret a feature important in random forest?

Feature importance is calculated as the decrease in node impurity weighted by the probability of reaching that node. The node probability can be calculated by the number of samples that reach the node, divided by the total number of samples. The higher the value the more important the feature.

How do you evaluate random forest performance?

For random forests, another common option is to use the out-of-bag predictions. Each individual tree is based on a bootstrap sample, this means that each tree was fit using on average about 2 thirds of the data, so the remaining 1 third makes a natural "Test" set for validation.

What is a good random forest accuracy?

Accuracy: 87.87 %. Accuracy of 87.8% is not a very great score and there is a lot of scope for improvement. Let's plot the difference between the actual and the predicted value.


1 Answers

The "inTrees" R package might be useful.

Here is an example.

Extract raw rules from a random forest:

library(inTrees) library(randomForest)  data(iris) X <- iris[, 1:(ncol(iris) - 1)]  # X: predictors target <- iris[,"Species"]  # target: class rf <- randomForest(X, as.factor(target)) treeList <- RF2List(rf)  # transform rf object to an inTrees' format exec <- extractRules(treeList, X)  # R-executable conditions exec[1:2,] #       condition                  # [1,] "X[,1]<=5.45 & X[,4]<=0.8" # [2,] "X[,1]<=5.45 & X[,4]>0.8" 

Measure rules. len is the number of variable-value pairs in a condition, freq is the percentage of data satisfying a condition, pred is the outcome of a rule, i.e., condition => pred, err is the error rate of a rule.

ruleMetric <- getRuleMetric(exec,X,target)  # get rule metrics ruleMetric[1:2,] #      len  freq    err     condition                  pred         # [1,] "2" "0.3"   "0"     "X[,1]<=5.45 & X[,4]<=0.8" "setosa"     # [2,] "2" "0.047" "0.143" "X[,1]<=5.45 & X[,4]>0.8"  "versicolor" 

Prune each rule:

ruleMetric <- pruneRule(ruleMetric, X, target) ruleMetric[1:2,] #      len  freq    err     condition                 pred         # [1,] "1" "0.333" "0"     "X[,4]<=0.8"              "setosa"     # [2,] "2" "0.047" "0.143" "X[,1]<=5.45 & X[,4]>0.8" "versicolor" 

Select a compact rule set:

(ruleMetric <- selectRuleRRF(ruleMetric, X, target)) #          len freq    err     condition                                             pred         impRRF               # [1,] "1" "0.333" "0"     "X[,4]<=0.8"                                          "setosa"     "1"                  # [2,] "3" "0.313" "0"     "X[,3]<=4.95 & X[,3]>2.6 & X[,4]<=1.65"               "versicolor" "0.806787615686919"  # [3,] "4" "0.333" "0.04"  "X[,1]>4.95 & X[,3]<=5.35 & X[,4]>0.8 & X[,4]<=1.75"  "versicolor" "0.0746284932951366" # [4,] "2" "0.287" "0.023" "X[,1]<=5.9 & X[,2]>3.05"                             "setosa"     "0.0355855756152103" # [5,] "1" "0.307" "0.022" "X[,4]>1.75"                                          "virginica"  "0.0329176860493297" # [6,] "4" "0.027" "0"     "X[,1]>5.45 & X[,3]<=5.45 & X[,4]<=1.75 & X[,4]>1.55" "versicolor" "0.0234818254947883" # [7,] "3" "0.007" "0"     "X[,1]<=6.05 & X[,3]>5.05 & X[,4]<=1.7"               "versicolor" "0.0132907201116241" 

Build an ordered rule list as a classifier:

(learner <- buildLearner(ruleMetric, X, target)) #      len freq                 err                  condition                                             pred         # [1,] "1" "0.333333333333333"  "0"                  "X[,4]<=0.8"                                          "setosa"     # [2,] "3" "0.313333333333333"  "0"                  "X[,3]<=4.95 & X[,3]>2.6 & X[,4]<=1.65"               "versicolor" # [3,] "4" "0.0133333333333333" "0"                  "X[,1]>5.45 & X[,3]<=5.45 & X[,4]<=1.75 & X[,4]>1.55" "versicolor" # [4,] "1" "0.34"               "0.0196078431372549" "X[,1]==X[,1]"                                        "virginica"  

Make rules more readable:

readableRules <- presentRules(ruleMetric, colnames(X)) readableRules[1:2, ] #      len  freq    err     condition                                                                       pred         # [1,] "1" "0.333" "0"     "Petal.Width<=0.8"                                                              "setosa"     # [2,] "3" "0.313" "0"     "Petal.Length<=4.95 & Petal.Length>2.6 & Petal.Width<=1.65"                     "versicolor" 

Extract frequent variable interactions (note the rules are not pruned or selected):

rf <- randomForest(X, as.factor(target)) treeList <- RF2List(rf)  # transform rf object to an inTrees' format exec <- extractRules(treeList, X)  # R-executable conditions ruleMetric <- getRuleMetric(exec, X, target)  # get rule metrics freqPattern <- getFreqPattern(ruleMetric) # interactions of at least two predictor variables freqPattern[which(as.numeric(freqPattern[, "len"]) >= 2), ][1:4, ] #      len sup     conf    condition                  pred         # [1,] "2" "0.045" "0.587" "X[,3]>2.45 & X[,4]<=1.75" "versicolor" # [2,] "2" "0.041" "0.63"  "X[,3]>4.75 & X[,4]>0.8"   "virginica"  # [3,] "2" "0.039" "0.604" "X[,4]<=1.75 & X[,4]>0.8"  "versicolor" # [4,] "2" "0.033" "0.675" "X[,4]<=1.65 & X[,4]>0.8"  "versicolor" 

One can also present these frequent patterns in a readable form using function presentRules.

In addition, rules or frequent patterns can be formatted in LaTex.

library(xtable) print(xtable(freqPatternSelect), include.rownames=FALSE) # \begin{table}[ht] # \centering # \begin{tabular}{lllll} #   \hline #   len & sup & conf & condition & pred \\  #   \hline #   2 & 0.045 & 0.587 & X[,3]$>$2.45 \& X[,4]$<$=1.75 & versicolor \\  #   2 & 0.041 & 0.63 & X[,3]$>$4.75 \& X[,4]$>$0.8 & virginica \\  #   2 & 0.039 & 0.604 & X[,4]$<$=1.75 \& X[,4]$>$0.8 & versicolor \\  #   2 & 0.033 & 0.675 & X[,4]$<$=1.65 \& X[,4]$>$0.8 & versicolor \\  #   \hline # \end{tabular} # \end{table} 
like image 163
H.D. Avatar answered Sep 21 '22 03:09

H.D.