I have constructed a decision tree using rpart for a dataset.
I have then divided the data into 2 parts - a training dataset and a test dataset. A tree has been constructed for the dataset using the training data. I want to calculate the accuracy of the predictions based on the model that was created.
My code is shown below:
library(rpart)
#reading the data
data = read.table("source")
names(data) <- c("a", "b", "c", "d", "class")
#generating test and train data - Data selected randomly with a 80/20 split
trainIndex <- sample(1:nrow(x), 0.8 * nrow(x))
train <- data[trainIndex,]
test <- data[-trainIndex,]
#tree construction based on information gain
tree = rpart(class ~ a + b + c + d, data = train, method = 'class', parms = list(split = "information"))
I now want to calculate the accuracy of the predictions generated by the model by comparing the results with the actual values train and test data however I am facing an error while doing so.
My code is shown below:
t_pred = predict(tree,test,type="class")
t = test['class']
accuracy = sum(t_pred == t)/length(t)
print(accuracy)
I get an error message that states -
Error in t_pred == t : comparison of these types is not implemented In addition: Warning message: Incompatible methods ("Ops.factor", "Ops.data.frame") for "=="
On checking the type of t_pred, I found out that it is of type integer however the documentation
(https://stat.ethz.ch/R-manual/R-devel/library/rpart/html/predict.rpart.html)
states that the predict()
method must return a vector.
I am unable to understand why is the type of the variable is an integer and not a list. Where have I made the mistake and how can I fix it?
Accuracy can be computed by comparing actual test set values and predicted values. Well, you got a classification rate of 67.53%, considered as good accuracy. You can improve this accuracy by tuning the parameters in the Decision Tree Algorithm.
Rpart is a powerful machine learning library in R that is used for building classification and regression trees. This library implements recursive partitioning and is very easy to use.
From the rpart vignette (page 12), “An overall measure of variable importance is the sum of the goodness of split measures for each split for which it was the primary variable, plus goodness (adjusted agreement) for all splits in which it was a surrogate.”
The decision tree classifier gave an accuracy of 91%.
Try calculating the confusion matrix first:
confMat <- table(test$class,t_pred)
Now you can calculate the accuracy by dividing the sum diagonal of the matrix - which are the correct predictions - by the total sum of the matrix:
accuracy <- sum(diag(confMat))/sum(confMat)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With