Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R - how to use rpart?

Tags:

r

I do not manage to get much information using rpart.

I have a data frame:

a = structure(list(V1 = c(2, 3, 4, 2, 3, 2, 3, 3, 5, 3), V2 = c(15, 
26, 94, 15, 26, 33, 33, 33, 5, 15), V3 = structure(c(1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("f", "t"), class = "factor")), .Names = c("V1", 
"V2", "V3"), row.names = c(NA, -10L), class = "data.frame")

> a
   V1 V2 V3
1   2 15  f
2   3 26  f
3   4 94  f
4   2 15  f
5   3 26  f
6   2 33  f
7   3 33  f
8   3 33  t
9   5  5  t
10  3 15  t

> rpart(V3 ~ ., data=a)
n= 10 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

1) root 10 3 f (0.7000000 0.3000000) *

Why is rpart not providing more information, for example the fact that there were three cases of (V1 == 2) all leading to response = "f" (lines 1,4,6)?

In essence, I would like to find out:

  • what tests were run by rpart before it gave me the output above?
  • did rpart include a test (V2 == 2) -> response statistics, and if not, how can I make it include such a test and result?

I have read the rpart vignette, but I do not find the answers.

like image 459
Timothée HENRY Avatar asked Apr 30 '14 15:04

Timothée HENRY


2 Answers

The short answer to your question might be this:

rpart(V3 ~ V1 + V2,data = a,control = rpart.control(minsplit = 5))

So you might want to spend some time reading the documentation, with a particular emphasis on rpart.control. But more broadly, note that rpart is still not "testing" a split based on the criteria V2 == 2, simply because that variable is continuous. All splits on continuous variables will be simple binary inequality splits. Only factors will be split according to a selection of a subset of levels.

The vignette you linked to contains an extensive discussion (with citations to further discussions) on precisely what process the function went through to choose the splits it did, so I'm not sure how to respond to your claim that you read it but found no answers.

like image 147
joran Avatar answered Sep 27 '22 23:09

joran


If you start with ?rpart and follow the info under Value, you'll go to ?rpart.object, which will tell you:

frame data frame with one row for each node in the tree. The row.names of frame contain the (unique) node numbers that follow a binary ordering indexed by node depth. Columns of frame include var, a factor giving the names of the variables used in the split at each node (leaf nodes are denoted by the level ""), n, the number of observations reaching the node, wt, the sum of case weights for observations reaching the node, dev, the deviance of the node, yval, the fitted value of the response at the node, and splits, a two column matrix of left and right split labels for each node. Also included in the frame are complexity, the complexity parameter at which this split will collapse, ncompete, the number of competitor splits recorded, and nsurrogate, the number of surrogate splits recorded.

Extra response information which may be present is in yval2, which contains the number of events at the node (poisson tree), or a matrix containing the fitted class, the class counts for each node, the class probabilities and the ‘node probability’ (classification trees).

where an integer vector of the same length as the number of observations in the root node, containing the row number of frame corresponding to the leaf node that each observation falls into.

call an image of the call that produced the object, but with the arguments all named and with the actual formula included as the formula argument. To re-evaluate the call, say update(tree).

terms an object of class c("terms", "formula") (see terms.object) summarizing the formula. Used by various methods, but typically not of direct relevance to users.

splits a numeric matrix describing the splits: only present if there are any. The row label is the name of the split variable, and columns are count, the number of observations (which are not missing and are of positive weight) sent left or right by the split (for competitor splits this is the number that would have been sent left or right had this split been used, for surrogate splits it is the number missing the primary split variable which were decided using this surrogate), ncat, the number of categories or levels for the variable (+/-1 for a continuous variable), improve, which is the improvement in deviance given by this split, or, for surrogates, the concordance of the surrogate with the primary, and index, the numeric split point. The last column adj gives the adjusted concordance for surrogate splits. For a factor, the index column contains the row number of the csplit matrix. For a continuous variable, the sign of ncat determines whether the subset x < cutpoint or x > cutpoint is sent to the left.

csplit an integer matrix. (Only present only if at least one of the split variables is a factor or ordered factor.) There is a row for each such split, and the number of columns is the largest number of levels in the factors. Which row is given by the index column of the splits matrix. The columns record 1 if that level of the factor goes to the left, 3 if it goes to the right, and 2 if that level is not present at this node of the tree (or not defined for the factor).

method character string: the method used to grow the tree. One of "class", "exp", "poisson", "anova" or "user" (if splitting functions were supplied).

cptable a matrix of information on the optimal prunings based on a complexity parameter.

variable.importance a named numeric vector giving the importance of each variable. (Only present if there are any splits.) When printed by summary.rpart these are rescaled to add to 100.

numresp integer number of responses; the number of levels for a factor response.

**parms, contro**l a record of the arguments supplied, which defaults filled in.

functions the summary, print and text functions for method used.

ordered a named logical vector recording for each variable if it was an ordered factor.

na.action (where relevant) information returned by model.frame on the special handling of NAs derived from the na.action argument.

like image 32
Carl Witthoft Avatar answered Sep 27 '22 22:09

Carl Witthoft