result of rpart is a root, but data shows Information Gain

Tags:

I have a dataset with an event rate of less than 3% (i.e. there are about 700 records with class 1 and 27000 records with class 0).

ID          V1  V2      V3  V5      V6  Target
SDataID3    161 ONE     1   FOUR    0   0
SDataID4    11  TWO     2   THREE   2   1
SDataID5    32  TWO     2   FOUR    2   0
SDataID7    13  ONE     1   THREE   2   0
SDataID8    194 TWO     2   FOUR    0   0
SDataID10   63  THREE   3   FOUR    0   1
SDataID11   89  ONE     1   FOUR    0   0
SDataID13   78  TWO     2   FOUR    0   0
SDataID14   87  TWO     2   THREE   1   0
SDataID15   81  ONE     1   THREE   0   0
SDataID16   63  ONE     3   FOUR    0   0
SDataID17   198 ONE     3   THREE   0   0
SDataID18   9   TWO     3   THREE   0   0
SDataID19   196 ONE     2   THREE   2   0
SDataID20   189 TWO     2   ONE     1   0
SDataID21   116 THREE   3   TWO     0   0
SDataID24   104 ONE     1   FOUR    0   0
SDataID25   5   ONE     2   ONE     3   0
SDataID28   173 TWO     3   FOUR    0   0
SDataID29   5   ONE     3   ONE     3   0
SDataID31   87  ONE     3   FOUR    3   0
SDataID32   5   ONE     2   THREE   1   0
SDataID34   45  ONE     1   FOUR    0   0
SDataID35   19  TWO     2   THREE   0   0
SDataID37   133 TWO     2   FOUR    0   0
SDataID38   8   ONE     1   THREE   0   0
SDataID39   42  ONE     1   THREE   0   0
SDataID43   45  ONE     1   THREE   1   0
SDataID44   45  ONE     1   FOUR    0   0
SDataID45   176 ONE     1   FOUR    0   0
SDataID46   63  ONE     1   THREE   3   0

I am trying to find out split using the decision tree. But result of tree is only 1 root.

> library(rpart)
> tree <- rpart(Target ~ ., data=subset(train, select=c( -Record.ID) ),method="class")
> printcp(tree)

Classification tree:
rpart(formula = Target ~ ., data = subset(train, select = c(-Record.ID)), method = "class")

Variables actually used in tree construction:
character(0)

Root node error: 749/18239 = 0.041066

n= 18239 

  CP nsplit rel error xerror xstd
1  0      0         1      0    0

After reading most of the resources on StackOverflow, I loosened/tweaked the control parameters which gave me the desired decision tree.

> tree <- rpart(Target ~ ., data=subset(train, select=c( -Record.ID) ),method="class" ,control =rpart.control(minsplit = 1,minbucket=2, cp=0.00002))
> printcp(tree)

Classification tree:
rpart(formula = Target ~ ., data = subset(train, select = c(-Record.ID)), 
    method = "class", control = rpart.control(minsplit = 1, minbucket = 2, 
        cp = 2e-05))

Variables actually used in tree construction:
[1] V5         V2                     V1          
[4] V3         V6

Root node error: 749/18239 = 0.041066

n= 18239 

          CP nsplit rel error xerror     xstd
1 0.00024275      0   1.00000 1.0000 0.035781
2 0.00019073     20   0.99466 1.0267 0.036235
3 0.00016689     34   0.99199 1.0307 0.036302
4 0.00014835     54   0.98798 1.0334 0.036347
5 0.00002000     63   0.98665 1.0427 0.036504

When I pruned the tree it resulted in a tree with a single node.

> pruned.tree <- prune(tree, cp = tree$cptable[which.min(tree$cptable[,"xerror"]),"CP"])
> printcp(pruned.tree)

Classification tree:
rpart(formula = Target ~ ., data = subset(train, select = c(-Record.ID)), 
    method = "class", control = rpart.control(minsplit = 1, minbucket = 2, 
        cp = 2e-05))

Variables actually used in tree construction:
character(0)

Root node error: 749/18239 = 0.041066

n= 18239 

          CP nsplit rel error xerror     xstd
1 0.00024275      0         1      1 0.035781

The tree should not be giving out only root node because mathematically, on a given node (example provided) we are getting Information Gain. I don't know if I am making a mistake by pruning or there is an issue with rpart in handling low event rate dataset?

NODE    p       1-p     Entropy         Weights         Ent*Weight      # Obs
Node 1  0.032   0.968   0.204324671     0.351398601     0.071799404     10653
Node 2  0.05    0.95    0.286396957     0.648601399     0.185757467     19663

Sum(Ent*wght)       0.257556871 
Information gain    0.742443129

202

asked Oct 31 '17 06:10

Rachit Jain

1 Answers

The data you provided does not reflect the ratio of the two target classes, so I've tweaked the data to better reflect that (see Data section):

> prop.table(table(train$Target))

         0          1 
0.96707581 0.03292419 

> 700/27700
[1] 0.02527076

The ratios are now relatively close...

library(rpart)
tree <- rpart(Target ~ ., data=train, method="class")
printcp(tree)

Results in:

Classification tree:
rpart(formula = Target ~ ., data = train, method = "class")

Variables actually used in tree construction:
character(0)

Root node error: 912/27700 = 0.032924

n= 27700 

  CP nsplit rel error xerror xstd
1  0      0         1      0    0

Now, the reason that you are seeing only the root node for your first model, is probably due to the fact that you have extremely imbalanced target classes, and so, your independent variables could not provide enough information to grow the tree. My sample data has 3.3% event rate, but yours has only about 2.5%!

As you have mentioned, there is a way to force rpart to grow the tree. That is to override the default complexity parameter (cp). The complexity measure is a combination of the size of the tree and how well the tree separates the target classes. From ?rpart.control, "Any split that does not decrease the overall lack of fit by a factor of cp is not attempted". This means that your model at this point does not have a split beyond the root node that decreases the complexity level enough for rpart to take into consideration. We can relax this threshold of what is considered "enough" by either setting a low or a negative cp (negative cp basically forces the tree to grow to its full size).

tree <- rpart(Target ~ ., data=train, method="class" ,parms = list(split = 'information'), 
              control =rpart.control(minsplit = 1,minbucket=2, cp=0.00002))
printcp(tree)

Results in:

Classification tree:
rpart(formula = Target ~ ., data = train, method = "class", parms = list(split = "information"), 
    control = rpart.control(minsplit = 1, minbucket = 2, cp = 2e-05))

Variables actually used in tree construction:
[1] ID V1 V2 V3 V5 V6

Root node error: 912/27700 = 0.032924

n= 27700 

           CP nsplit rel error xerror     xstd
1  4.1118e-04      0   1.00000 1.0000 0.032564
2  3.6550e-04     30   0.98355 1.0285 0.033009
3  3.2489e-04     45   0.97807 1.0702 0.033647
4  3.1328e-04    106   0.95504 1.0877 0.033911
5  2.7412e-04    116   0.95175 1.1031 0.034141
6  2.5304e-04    132   0.94737 1.1217 0.034417
7  2.1930e-04    149   0.94298 1.1458 0.034771
8  1.9936e-04    159   0.94079 1.1502 0.034835
9  1.8275e-04    181   0.93640 1.1645 0.035041
10 1.6447e-04    193   0.93421 1.1864 0.035356
11 1.5664e-04    233   0.92654 1.1853 0.035341
12 1.3706e-04    320   0.91228 1.2083 0.035668
13 1.2183e-04    344   0.90899 1.2127 0.035730
14 9.9681e-05    353   0.90789 1.2237 0.035885
15 2.0000e-05    364   0.90680 1.2259 0.035915

As you can see, the tree has grown to a size that reduces the complexity level by a minimum of cp. Two things to note:

At zero nsplit, CP is already as low as 0.0004, where as the default cp in rpart is set to 0.01.
Starting from nsplit == 0, the cross validation error (xerror) increases as you increase the number of splits.

Both of these indicate that your model is overfitting the data at nsplit == 0 and beyond, since adding more independent variables into your model does not add enough information (insufficient reduction in CP) to reduce the cross validation error. With this being said, your root node model is the best model in this case, which explains why your initial model has only the root node.

pruned.tree <- prune(tree, cp = tree$cptable[which.min(tree$cptable[,"xerror"]),"CP"])
printcp(pruned.tree)

Results in:

Classification tree:
rpart(formula = Target ~ ., data = train, method = "class", parms = list(split = "information"), 
    control = rpart.control(minsplit = 1, minbucket = 2, cp = 2e-05))

Variables actually used in tree construction:
character(0)

Root node error: 912/27700 = 0.032924

n= 27700 

          CP nsplit rel error xerror     xstd
1 0.00041118      0         1      1 0.032564

As for the pruning part, it is now clearer why your pruned tree is the root node tree, since a tree that goes beyond 0 splits has increasing cross validation error. Taking the tree with the minimum xerror would leave you with root node tree as expected.

Information gain basically tells you how much "information" is added for each split. So technically, every split has some degree of information gain since you are adding more variables into your model (information gain is always non-negative). What you should think about is whether that additional gain (or no gain) reduces the errors enough for you to warrant a more complex model. Hence, the tradeoff between bias and variance.

In this case, it doesn't really make sense for you to reduce cp and later prune the resulting tree. since by setting a low cp, you are telling rpart to make splits even if it overfits, while pruning "cuts" all the nodes that overfits.

Data:

Note that I am shuffling the rows for each column and sample instead of sampling the row indices. This is because the data you provided is probably not a random sample of your original dataset (likely biased), so I am basically randomly creating new observations with combinations of your existing rows which would hopefully reduce that bias.

init_train = structure(list(ID = structure(c(16L, 24L, 29L, 30L, 31L, 1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 
17L, 18L, 19L, 20L, 21L, 22L, 23L, 25L, 26L, 27L, 28L), .Label = c("SDataID10", 
"SDataID11", "SDataID13", "SDataID14", "SDataID15", "SDataID16", 
"SDataID17", "SDataID18", "SDataID19", "SDataID20", "SDataID21", 
"SDataID24", "SDataID25", "SDataID28", "SDataID29", "SDataID3", 
"SDataID31", "SDataID32", "SDataID34", "SDataID35", "SDataID37", 
"SDataID38", "SDataID39", "SDataID4", "SDataID43", "SDataID44", 
"SDataID45", "SDataID46", "SDataID5", "SDataID7", "SDataID8"), class = "factor"), 
    V1 = c(161L, 11L, 32L, 13L, 194L, 63L, 89L, 78L, 87L, 81L, 
    63L, 198L, 9L, 196L, 189L, 116L, 104L, 5L, 173L, 5L, 87L, 
    5L, 45L, 19L, 133L, 8L, 42L, 45L, 45L, 176L, 63L), V2 = structure(c(1L, 
    3L, 3L, 1L, 3L, 2L, 1L, 3L, 3L, 1L, 1L, 1L, 3L, 1L, 3L, 2L, 
    1L, 1L, 3L, 1L, 1L, 1L, 1L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L
    ), .Label = c("ONE", "THREE", "TWO"), class = "factor"), 
    V3 = c(1L, 2L, 2L, 1L, 2L, 3L, 1L, 2L, 2L, 1L, 3L, 3L, 3L, 
    2L, 2L, 3L, 1L, 2L, 3L, 3L, 3L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 
    1L, 1L, 1L), V5 = structure(c(1L, 3L, 1L, 3L, 1L, 1L, 1L, 
    1L, 3L, 3L, 1L, 3L, 3L, 3L, 2L, 4L, 1L, 2L, 1L, 2L, 1L, 3L, 
    1L, 3L, 1L, 3L, 3L, 3L, 1L, 1L, 3L), .Label = c("FOUR", "ONE", 
    "THREE", "TWO"), class = "factor"), V6 = c(0L, 2L, 2L, 2L, 
    0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 2L, 1L, 0L, 0L, 3L, 0L, 
    3L, 3L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 3L), Target = c(0L, 
    1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L
    )), .Names = c("ID", "V1", "V2", "V3", "V5", "V6", "Target"
), class = "data.frame", row.names = c(NA, -31L))

set.seed(1000)
train = as.data.frame(lapply(init_train, function(x) sample(x, 27700, replace = TRUE)))

140

answered Nov 15 '22 19:11

acylam

Related questions
                            
                                Extending dplyr and use of internal functions
                            
                                R join on like/grep condition
                            
                                How to expand Posixct field in R str()?
                            
                                Wordcloud is cropping text
                            
                                Scraping NBA data in R with rjson
                            
                                Plotting streets with ggmap and geom_path
                            
                                Embed Google analytics tracking code in shiny app
                            
                                Error: `data` must be a data frame, or other object coercible by `fortify()`, not an S3 object with class Date
                            
                                Base word stemming instead of root word stemming in R
                            
                                How do I escape double-brackets [[ in-line in org-mode
                            
                                ggplot2 polar plot axis label location
                            
                                Produce handouts of slidify or Rpres HTML5 presentations
                            
                                Can you "tie", or provide an alternative implementation, of a data.frame in R?
                            
                                Alternative to do.call for large datasets
                            
                                Datetime formatting in R3.0.3 and R3.1.3
                            
                                RODBC ERROR: Could not SQLExecDirect in mysql
                            
                                Embed shiny app into a jupyter notebook
                            
                                .Internal(La_rs(x,FALSE)) crashes R after long (reproducible) script on windows only
                            
                                Package has a VignetteBuilder field but no prebuilt vignette index
                            
                                Controlling placement of figures with the knitr::include_graphics function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

result of rpart is a root, but data shows Information Gain

Tags:

r

machine-learning

decision-tree

rpart

information-gain

Rachit Jain

People also ask

1 Answers

acylam

Recent Activity

Donate For Us