Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

The result of rpart is just with 1 root

Tags:

r

rpart

As in my dataset ,the Leakage have two value 1,0. There are just about 300 rows with 1 and extra in 569378 rows are with 1. This would be the reason that I just got 1 root in the rpart result.

How can I solve this?

fm.pipe<-Leakage~PipeAge +PipePressure

> printcp(CART.fit)

Regression tree:
rpart(formula = fm.pipe, data = Data)

Variables actually used in tree construction:
character(0)

Root node error: 299.84/569378 = 0.00052661

n= 569378 

         CP nsplit rel error xerror xstd
1 0.0033246      0         1      0    0
like image 300
user3172776 Avatar asked Jan 08 '14 10:01

user3172776


People also ask

What is root node error in rpart?

Root node error is the percent of correctly sorted records at the first (root) splitting node. This value can be used to calculate two measures of predictive performance in combination with Rel Error and X Error, both of which are included in the Pruning Table.

What is rpart in decision tree?

Rpart is a powerful machine learning library in R that is used for building classification and regression trees. This library implements recursive partitioning and is very easy to use.

How does rpart work in R?

The rpart algorithm works by splitting the dataset recursively, which means that the subsets that arise from a split are further split until a predetermined termination criterion is reached.

Does rpart automatically prune?

No, but the defaults for the fitting function may stop splitting "early" (for some definition of "early").


2 Answers

There may not be a way to "solve" this, if the independent variables do not provide enough information to grow the tree. See, for example, the help for rpart.control: "Any split that does not decrease the overall lack of fit by a factor of cp is not attempted." You could try loosening the control parameters, but there's no guarantee that will result in the tree growing beyond a root.

CART.fit <- rpart(formula=fm.pipe, data=Data, control=rpart.control(minsplit=2, minbucket=1, cp=0.001))
like image 114
Jean V. Adams Avatar answered Nov 15 '22 18:11

Jean V. Adams


I'm not sure I understand your row length issue, but here's what that error typically means:

rpart uses constraints to build a decision tree. Here's the default values, from the docs:

rpart.control(minsplit = 20, minbucket = round(minsplit/3), cp = 0.01, 
      maxcompete = 4, maxsurrogate = 5, usesurrogate = 2, xval = 10,
      surrogatestyle = 0, maxdepth = 30, ...)

You need to lessen these restraints. As @JeanVAdams said, start with the bare minimum:

rpart(formula=fm.pipe, data=Data, 
      control=rpart.control(minsplit=1, minbucket=1, cp=0.001))

Your first result will probably have way too many nodes, so you will have to slowly build up these restraints until you get a decent sized tree.


If you're still confused, here's an example:

Let's say you are looking at grocery store data, and you want to see a tree of the most popular hours to shop. There's only 24 hours, right? So there's only 24 possibilities for the independent variable. Rpart has a condition that says

"There must be at least 20 things in a node for me to split it."

This means your node can't even split once. Even if you have 15 billion rows, there's only 24 possible ways to split it. It's more complex than this probably, but this is a good place to start.

I actually was looking at this exact issue (shoppers by hour), and I had to leave my constraints at the lowest possible level in order to get a tree at all:

rpart(formula=fm.pipe, data=Data, control=rpart.control(minsplit=1, minbucket=1, cp=0.001))

like image 23
Travis Heeter Avatar answered Nov 15 '22 18:11

Travis Heeter