Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find which columns affect a prediction in R

Tags:

r

naivebayes

Say, I am working on a machine learning model in R using naive bayes. So I would build a model using the naiveBayes package as follows

model <- naiveBayes(Class ~ ., data = HouseVotes84)

I can also print out the weights of the model by just printing the model.

And I do the prediction as follows, and this gives me one of the classes as the prediction

predict(model, HouseVotes84[1:10,], type = "raw")

However, my question is, is there a way to see which of the columns affected this prediction the most? So, I can get to know what are the most important contributing factors to a student failing the class, say, if that was the response variable, and the various possible factors were the other predictor columns.

My question is for any package in R, naiveBayes above is just an example.

like image 500
saltandwater Avatar asked Dec 06 '15 21:12

saltandwater


1 Answers

The answer depends on how you want to do the feature selection.

If it is part of the model building process and not some post-hoc analysis you could use caret with its feature selection wrapper methods to determine the best subset of features to model with recursive feature elmination, genetic algorithms etc, or filtering using univariate analysis.


If it is part of your post-hoc analysis based solely on your prediction. Then it depends on the type of model you have used. caret also supports this functionality for compatible models only!

For svm, with the exception of linear kernels, determining the importance of the coefficients is highly non-trivial. I'm unaware of any attempt to try to do some kind of feature ranking for svm in general regardless of language (please tell me if it does exist!!).

With rpart (as its tagged in the question) you can just visually look at the nodes. The higher the node the more important it is. This can be done in the caret package:

library(rpart)
library(caret)
fit <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis)
caret::varImp(fit)
#        Overall
#Age    5.896114
#Number 3.411081
#Start  8.865279

With naiveBayes you can see it from your model output. You just have to stare really hard:

data(HouseVotes84, package = "mlbench")
model <- naiveBayes(Class ~ ., data = HouseVotes84)
model
#
#Naive Bayes Classifier for Discrete Predictors
#
#Call:
#naiveBayes.default(x = X, y = Y, laplace = laplace)
#
#A-priori probabilities:
#Y
#  democrat republican 
# 0.6137931  0.3862069 
#
#Conditional probabilities:
#            V1
#Y                    n         y
#  democrat   0.3953488 0.6046512
#  republican 0.8121212 0.1878788
#
#            V2
#Y                    n         y
#  democrat   0.4979079 0.5020921
#  republican 0.4932432 0.5067568

A very brief glance shows that at least V1 looks like a better variable than V2.

like image 175
chappers Avatar answered Oct 10 '22 20:10

chappers