I'm training a model in R with the caret package:
ctrl <- trainControl(method = "repeatedcv", repeats = 3, summaryFunction = twoClassSummary)
logitBoostFit <- train(LoanStatus~., credit, method = "LogitBoost", family=binomial, preProcess=c("center", "scale", "pca"),
trControl = ctrl)
I'm getting the following warnings:
Warning message:
In train.default(x, y, weights = w, ...): The metric "Accuracy" was not in the result set. ROC will be used instead.Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, : There were missing values in resampled performance measures.
Something is wrong; all the ROC metric values are missing:
ROC Sens Spec
Min. : NA Min. :0.03496 Min. :0.9747
1st Qu.: NA 1st Qu.:0.03919 1st Qu.:0.9758
Median : NA Median :0.04343 Median :0.9770
Mean :NaN Mean :0.04349 Mean :0.9779
3rd Qu.: NA 3rd Qu.:0.04776 3rd Qu.:0.9795
Max. : NA Max. :0.05210 Max. :0.9821
NA's :3
Error in train.default(x, y, weights = w, ...): Stopping
I installed the pROC package:
install.packages("pROC", repos="http://cran.rstudio.com/")
library(pROC)
Type 'citation("pROC")' for a citation.
Attaching package: ‘pROC’
The following objects are masked from ‘package:stats’:
cov, smooth, var
Here's the data:
str(credit)
'data.frame': 8580 obs. of 45 variables:
$ ListingCategory : int 1 7 3 1 1 7 1 1 1 1 ...
$ IncomeRange : int 3 4 6 4 4 3 3 4 3 3 ...
$ StatedMonthlyIncome : num 2583 4326 10500 4167 5667 ...
$ IncomeVerifiable : logi TRUE TRUE TRUE FALSE TRUE TRUE ...
$ DTIwProsperLoan : num 1.8e-01 2.0e-01 1.7e-01 1.0e+06 1.8e-01 4.4e-01 2.2e-01 2.0e-01 2.0e-01 3.1e-01 ...
$ EmploymentStatusDescription: Factor w/ 7 levels "Employed","Full-time",..: 1 4 1 7 1 1 1 1 1 1 ...
$ Occupation : Factor w/ 65 levels "","Accountant/CPA",..: 37 37 20 14 43 58 48 37 37 37 ...
$ MonthsEmployed : int 4 44 159 67 26 16 209 147 24 9 ...
$ BorrowerState : Factor w/ 48 levels "AK","AL","AR",..: 22 32 5 5 14 28 4 10 10 34 ...
$ BorrowerCity : Factor w/ 3089 levels "AARONSBURG","ABERDEEN",..: 1737 3059 2488 654 482 719 895 1699 2747 1903 ...
$ BorrowerMetropolitanArea : Factor w/ 1 level "(Not Implemented)": 1 1 1 1 1 1 1 1 1 1 ...
$ LenderIndicator : int 0 0 0 1 0 0 0 0 1 0 ...
$ GroupIndicator : logi FALSE FALSE FALSE TRUE FALSE FALSE ...
$ GroupName : Factor w/ 83 levels "","00 Used Car Loans",..: 1 1 1 47 1 1 1 1 1 1 ...
$ ChannelCode : int 90000 90000 90000 80000 40000 40000 90000 90000 80000 90000 ...
$ AmountParticipation : int 0 0 0 0 0 0 0 0 0 0 ...
$ MonthlyDebt : int 247 785 1631 817 644 1524 427 817 654 749 ...
$ CurrentDelinquencies : int 0 0 0 0 0 0 0 1 0 1 ...
$ DelinquenciesLast7Years : int 0 10 0 0 0 0 0 0 0 0 ...
$ PublicRecordsLast10Years : int 0 1 0 0 0 0 1 0 1 0 ...
$ PublicRecordsLast12Months : int 0 0 0 0 0 0 0 0 0 0 ...
$ FirstRecordedCreditLine : Factor w/ 4719 levels "1/1/00 0:00",..: 3032 2673 1197 2541 4698 4345 3150 925 4452 2358 ...
$ CreditLinesLast7Years : int 53 30 36 26 7 22 15 20 34 32 ...
$ InquiriesLast6Months : int 2 8 5 0 0 0 0 3 0 0 ...
$ AmountDelinquent : int 0 0 0 0 0 0 0 63 0 15 ...
$ CurrentCreditLines : int 10 10 18 10 4 11 6 10 7 8 ...
$ OpenCreditLines : int 9 10 15 8 3 8 5 7 7 8 ...
$ BankcardUtilization : num 0.26 0.69 0.94 0.69 0.81 0.38 0.55 0.24 0.03 0 ...
$ TotalOpenRevolvingAccounts : int 9 7 12 10 3 5 4 5 4 6 ...
$ InstallmentBalance : int 48648 14827 0 0 0 30916 0 21619 41340 15447 ...
$ RealEstateBalance : int 0 0 577745 0 0 0 191296 0 0 126039 ...
$ RevolvingBalance : int 5265 9967 94966 50511 37871 22463 19550 2436 1223 3236 ...
$ RealEstatePayment : int 0 0 4159 0 0 0 1303 0 0 1279 ...
$ RevolvingAvailablePercent : int 78 52 36 45 18 61 44 74 96 76 ...
$ TotalInquiries : int 8 11 15 2 0 0 1 7 1 1 ...
$ TotalTradeItems : int 53 30 36 26 7 22 15 20 34 32 ...
$ SatisfactoryAccounts : int 52 23 36 26 7 19 15 18 34 29 ...
$ NowDelinquentDerog : int 0 0 0 0 0 0 0 1 0 1 ...
$ WasDelinquentDerog : int 1 7 0 0 0 3 0 1 0 2 ...
$ OldestTradeOpenDate : int 5092001 5011977 12011984 4272000 9081993 9122000 6161987 11181999 9191990 4132000 ...
$ DelinquenciesOver30Days : int 0 6 0 0 0 13 0 2 0 2 ...
$ DelinquenciesOver60Days : int 0 4 0 0 0 0 0 0 0 1 ...
$ DelinquenciesOver90Days : int 0 10 0 0 0 0 0 0 0 0 ...
$ IsHomeowner : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
$ LoanStatus : Factor w/ 2 levels "0","1": 2 1 1 2 2 2 2 2 2 1 .`..
summary(credit)
ListingCategory IncomeRange StatedMonthlyIncome IncomeVerifiable
Min. : 0.000 Min. :1.000 Min. : 0 Mode :logical
1st Qu.: 1.000 1st Qu.:3.000 1st Qu.: 3167 FALSE:784
Median : 2.000 Median :4.000 Median : 4750 TRUE :7796
Mean : 4.997 Mean :4.089 Mean : 5755 NA's :0
3rd Qu.: 7.000 3rd Qu.:5.000 3rd Qu.: 7083
Max. :20.000 Max. :7.000 Max. :250000
DTIwProsperLoan EmploymentStatusDescription MonthsEmployed
Min. : 0.0 Employed :7182 Min. :-23.00
1st Qu.: 0.1 Full-time : 416 1st Qu.: 26.00
Median : 0.2 Not employed : 122 Median : 68.00
Mean : 91609.4 Other : 475 Mean : 97.44
3rd Qu.: 0.3 Part-time : 7 3rd Qu.:139.00
Max. :1000000.0 Retired : 32 Max. :755.00
Self-employed: 346 NA's :5
BorrowerState LenderIndicator GroupIndicator ChannelCode
CA :1056 Min. :0.00000 Mode :logical Min. :40000
FL : 608 1st Qu.:0.00000 FALSE:8325 1st Qu.:80000
NY : 574 Median :0.00000 TRUE :255 Median :80000
TX : 532 Mean :0.09196 NA's :0 Mean :77196
IL : 443 3rd Qu.:0.00000 3rd Qu.:90000
GA : 343 Max. :1.00000 Max. :90000
(Other):5024
MonthlyDebt CurrentDelinquencies DelinquenciesLast7Years
Min. : 0.0 Min. : 0.0000 Min. : 0.000
1st Qu.: 364.0 1st Qu.: 0.0000 1st Qu.: 0.000
Median : 708.0 Median : 0.0000 Median : 0.000
Mean : 885.5 Mean : 0.4119 Mean : 4.009
3rd Qu.: 1205.2 3rd Qu.: 0.0000 3rd Qu.: 3.000
Max. :30213.0 Max. :21.0000 Max. :99.000
PublicRecordsLast10Years PublicRecordsLast12Months CreditLinesLast7Years
Min. : 0.0000 Min. :0.00000 Min. : 2.0
1st Qu.: 0.0000 1st Qu.:0.00000 1st Qu.: 16.0
Median : 0.0000 Median :0.00000 Median : 24.0
Mean : 0.2809 Mean :0.01364 Mean : 26.1
3rd Qu.: 0.0000 3rd Qu.:0.00000 3rd Qu.: 34.0
Max. :11.0000 Max. :4.00000 Max. :115.0
InquiriesLast6Months AmountDelinquent CurrentCreditLines OpenCreditLines
Min. : 0.0000 Min. : 0 Min. : 0.000 Min. : 0.000
1st Qu.: 0.0000 1st Qu.: 0 1st Qu.: 5.000 1st Qu.: 5.000
Median : 1.0000 Median : 0 Median : 9.000 Median : 8.000
Mean : 0.9994 Mean : 1195 Mean : 9.345 Mean : 8.306
3rd Qu.: 1.0000 3rd Qu.: 0 3rd Qu.:12.000 3rd Qu.:11.000
Max. :15.0000 Max. :179158 Max. :54.000 Max. :42.000
BankcardUtilization TotalOpenRevolvingAccounts InstallmentBalance
Min. :0.0000 Min. : 0.000 Min. : 0
1st Qu.:0.2500 1st Qu.: 3.000 1st Qu.: 3338
Median :0.5400 Median : 6.000 Median : 14453
Mean :0.5182 Mean : 6.441 Mean : 24900
3rd Qu.:0.7900 3rd Qu.: 9.000 3rd Qu.: 32238
Max. :2.2300 Max. :44.000 Max. :739371
NA's :328
RealEstateBalance RevolvingBalance RealEstatePayment RevolvingAvailablePercent
Min. : 0 Min. : 0 Min. : 0.0 Min. : 0.00
1st Qu.: 0 1st Qu.: 2799 1st Qu.: 0.0 1st Qu.: 29.00
Median : 26154 Median : 8784 Median : 346.5 Median : 52.00
Mean : 109306 Mean : 19555 Mean : 830.5 Mean : 51.46
3rd Qu.: 176542 3rd Qu.: 21110 3rd Qu.: 1382.2 3rd Qu.: 75.00
Max. :1938421 Max. :695648 Max. :13651.0 Max. :100.00
TotalInquiries TotalTradeItems SatisfactoryAccounts NowDelinquentDerog
Min. : 0.00 Min. : 2.0 Min. : 1.00 Min. : 0.0000
1st Qu.: 2.00 1st Qu.: 16.0 1st Qu.: 14.00 1st Qu.: 0.0000
Median : 3.00 Median : 24.0 Median : 21.00 Median : 0.0000
Mean : 3.91 Mean : 26.1 Mean : 23.34 Mean : 0.4119
3rd Qu.: 5.00 3rd Qu.: 34.0 3rd Qu.: 30.25 3rd Qu.: 0.0000
Max. :36.00 Max. :115.0 Max. :113.00 Max. :21.0000
WasDelinquentDerog OldestTradeOpenDate DelinquenciesOver30Days
Min. : 0.000 Min. : 1011957 Min. : 0.000
1st Qu.: 0.000 1st Qu.: 4101996 1st Qu.: 0.000
Median : 1.000 Median : 7191993 Median : 1.000
Mean : 2.343 Mean : 6934230 Mean : 4.332
3rd Qu.: 3.000 3rd Qu.:10011990 3rd Qu.: 5.000
Max. :32.000 Max. :12312004 Max. :99.000
DelinquenciesOver60Days DelinquenciesOver90Days IsHomeowner LoanStatus
Min. : 0.000 Min. : 0.000 Mode :logical 0:1518
1st Qu.: 0.000 1st Qu.: 0.000 FALSE:4264 1:7062
Median : 0.000 Median : 0.000 TRUE :4316
Mean : 1.908 Mean : 4.009 NA's :0
3rd Qu.: 2.000 3rd Qu.: 3.000
Max. :73.000 Max. :99.000
I didn't find any missing values:
try(na.fail(credit))
dput(head(credit,4))
structure(list(ListingCategory = c(1L, 7L, 3L, 1L), IncomeRange = c(3L,
4L, 6L, 4L), StatedMonthlyIncome = c(2583.3333, 4326, 10500,
4166.6667), IncomeVerifiable = c(TRUE, TRUE, TRUE, FALSE), DTIwProsperLoan = c(0.18,
0.2, 0.17, 1e+06), EmploymentStatusDescription = structure(c(1L,
4L, 1L, 7L), .Label = c("Employed", "Full-time", "Not employed",
"Other", "Part-time", "Retired", "Self-employed"), class = "factor"),
MonthsEmployed = c(4L, 44L, 159L, 67L), BorrowerState = structure(c(22L,
32L, 5L, 5L), .Label = c("AK", "AL", "AR", "AZ", "CA", "CO",
"CT", "DC", "DE", "FL", "GA", "HI", "ID", "IL", "IN", "KS",
"KY", "LA", "MA", "MD", "MI", "MN", "MO", "MS", "MT", "NC",
"NE", "NH", "NJ", "NM", "NV", "NY", "OH", "OK", "OR", "PA",
"RI", "SC", "SD", "TN", "TX", "UT", "VA", "VT", "WA", "WI",
"WV", "WY"), class = "factor"), LenderIndicator = c(0L, 0L,
0L, 1L), GroupIndicator = c(FALSE, FALSE, FALSE, TRUE), ChannelCode = c(90000L,
90000L, 90000L, 80000L), MonthlyDebt = c(247L, 785L, 1631L,
817L), CurrentDelinquencies = c(0L, 0L, 0L, 0L), DelinquenciesLast7Years = c(0L,
10L, 0L, 0L), PublicRecordsLast10Years = c(0L, 1L, 0L, 0L
), PublicRecordsLast12Months = c(0L, 0L, 0L, 0L), CreditLinesLast7Years = c(53L,
30L, 36L, 26L), InquiriesLast6Months = c(2L, 8L, 5L, 0L),
AmountDelinquent = c(0L, 0L, 0L, 0L), CurrentCreditLines = c(10L,
10L, 18L, 10L), OpenCreditLines = c(9L, 10L, 15L, 8L), BankcardUtilization = c(0.26,
0.69, 0.94, 0.69), TotalOpenRevolvingAccounts = c(9L, 7L,
12L, 10L), InstallmentBalance = c(48648L, 14827L, 0L, 0L),
RealEstateBalance = c(0L, 0L, 577745L, 0L), RevolvingBalance = c(5265L,
9967L, 94966L, 50511L), RealEstatePayment = c(0L, 0L, 4159L,
0L), RevolvingAvailablePercent = c(78L, 52L, 36L, 45L), TotalInquiries = c(8L,
11L, 15L, 2L), TotalTradeItems = c(53L, 30L, 36L, 26L), SatisfactoryAccounts = c(52L,
23L, 36L, 26L), NowDelinquentDerog = c(0L, 0L, 0L, 0L), WasDelinquentDerog = c(1L,
7L, 0L, 0L), OldestTradeOpenDate = c(5092001L, 5011977L,
12011984L, 4272000L), DelinquenciesOver30Days = c(0L, 6L,
0L, 0L), DelinquenciesOver60Days = c(0L, 4L, 0L, 0L), DelinquenciesOver90Days = c(0L,
10L, 0L, 0L), IsHomeowner = c(FALSE, FALSE, TRUE, FALSE),
LoanStatus = structure(c(2L, 1L, 1L, 2L), .Label = c("0",
"1"), class = "factor")), .Names = c("ListingCategory", "IncomeRange",
"StatedMonthlyIncome", "IncomeVerifiable", "DTIwProsperLoan",
"EmploymentStatusDescription", "MonthsEmployed", "BorrowerState",
"LenderIndicator", "GroupIndicator", "ChannelCode", "MonthlyDebt",
"CurrentDelinquencies", "DelinquenciesLast7Years", "PublicRecordsLast10Years",
"PublicRecordsLast12Months", "CreditLinesLast7Years", "InquiriesLast6Months",
"AmountDelinquent", "CurrentCreditLines", "OpenCreditLines",
"BankcardUtilization", "TotalOpenRevolvingAccounts", "InstallmentBalance",
"RealEstateBalance", "RevolvingBalance", "RealEstatePayment",
"RevolvingAvailablePercent", "TotalInquiries", "TotalTradeItems",
"SatisfactoryAccounts", "NowDelinquentDerog", "WasDelinquentDerog",
"OldestTradeOpenDate", "DelinquenciesOver30Days", "DelinquenciesOver60Days",
"DelinquenciesOver90Days", "IsHomeowner", "LoanStatus"), row.names = c(NA,
4L), class = "data.frame")
Any ideas on what's wrong?
Warning message:
In train.default(x, y, weights = w, ...): The metric "Accuracy" was not in the result set. ROC will be used instead.
# weights: 72 (71 variable)
initial value 5144.538374
iter 10 value 3540.667624
iter 20 value 3329.692768
iter 30 value 3279.191024
iter 40 value 3264.926986
iter 50 value 3259.276647
iter 60 value 3259.056261
final value 3259.032668
converged
# weights: 72 (71 variable)
initial value 5144.538374
iter 10 value 3540.774666
iter 20 value 3330.016829
iter 30 value 3279.545595
iter 40 value 3265.384385
iter 50 value 3259.499032
iter 60 value 3259.353010
final value 3259.342601
converged
# weights: 72 (71 variable)
initial value 5144.538374
iter 10 value 3540.667731
iter 20 value 3329.693092
iter 30 value 3279.191379
iter 40 value 3264.927427
iter 50 value 3259.276899
iter 60 value 3259.056561
final value 3259.032978
converged
# weights: 72 (71 variable)
initial value 5144.538374
iter 10 value 3528.401458
iter 20 value 3314.932958
iter 30 value 3264.117072
iter 40 value 3253.780051
iter 50 value 3253.368959
iter 60 value 3253.359047
final value 3253.358819
converged
# weights: 72 (71 variable)
initial value 5144.538374
iter 10 value 3528.508505
iter 20 value 3315.134599
iter 30 value 3265.021404
iter 40 value 3255.739021
iter 50 value 3253.817833
iter 60 value 3253.697180
final value 3253.671003
converged
# weights: 72 (71 variable)
initial value 5144.538374
iter 10 value 3528.401565
iter 20 value 3314.933160
iter 30 value 3264.117768
iter 40 value 3253.780539
iter 50 value 3253.369030
iter 60 value 3253.359358
final value 3253.359133
converged
# weights: 71 (70 variable)
initial value 5145.231521
iter 10 value 4680.326236
iter 20 value 4672.506024
iter 30 value 3662.998233
iter 40 value 3310.207744
iter 50 value 3252.983656
iter 60 value 3250.400275
iter 70 value 3250.339216
final value 3250.332646
converged
... # weights: 72 (71 variable)
initial value 5144.538374
iter 10 value 4661.569290
iter 20 value 4652.246624
iter 30 value 3715.472355
iter 40 value 3484.096833
iter 50 value 3254.247424
iter 60 value 3248.931841
iter 70 value 3248.154679
iter 80 value 3248.129089
iter 80 value 3248.129085
final value 3248.128574
converged
# weights: 72 (71 variable)
initial value 5144.538374
iter 10 value 4663.660886
iter 20 value 4654.255466
iter 30 value 3542.473235
iter 40 value 3315.027437
iter 50 value 3250.340679
iter 60 value 3248.693378
iter 70 value 3248.455840
iter 80 value 3248.443345
iter 80 value 3248.443325
iter 80 value 3248.443325
final value 3248.443325
converged
# weights: 72 (71 variable)
initial value 5144.538374
iter 10 value 4661.571382
iter 20 value 4652.248711
iter 30 value 4397.069608
iter 40 value 3532.067046
iter 50 value 3283.179445
iter 60 value 3249.518694
iter 70 value 3248.163057
iter 80 value 3248.129552
final value 3248.128889
converged
Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, : There were missing values in resampled performance measures.
Something is wrong; all the ROC metric values are missing:
ROC Sens Spec
Min. : NA Min. :0.01805 Min. :0.9946
1st Qu.: NA 1st Qu.:0.01805 1st Qu.:0.9946
Median : NA Median :0.01805 Median :0.9946
Mean :NaN Mean :0.01805 Mean :0.9946
3rd Qu.: NA 3rd Qu.:0.01805 3rd Qu.:0.9946
Max. : NA Max. :0.01805 Max. :0.9946
NA's :3
Error in train.default(x, y, weights = w, ...): Stopping
summaryFunction = twoClassSummary appears to trigger the warning. It happens here as well:
ctrl <- trainControl(method = "cv", summaryFunction = twoClassSummary)
multinomSummaryFit <- train(LoanStatus~., credit, method = "multinom", family=binomial,
trControl = ctrl)
Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, : There were missing values in resampled performance measures.
Something is wrong; all the ROC metric values are missing:
ROC Sens Spec
Min. : NA Min. :0.01919 Min. :0.9941
1st Qu.: NA 1st Qu.:0.01988 1st Qu.:0.9942
Median : NA Median :0.02056 Median :0.9943
Mean :NaN Mean :0.02011 Mean :0.9943
3rd Qu.: NA 3rd Qu.:0.02056 3rd Qu.:0.9943
Max. : NA Max. :0.02057 Max. :0.9944
NA's :3
Error in train.default(x, y, weights = w, ...): Stopping
Looking at the output of summary(credit)
, I can see that there are NA
values in at least two variables;
The variable MonthsEmployed
has 5 NA
values:
MonthsEmployed
Min. :-23.00
1st Qu.: 26.00
Median : 68.00
Mean : 97.44
3rd Qu.:139.00
Max. :755.00
NA's :5
and the variable InstallmentBalance
has 328 NA
values.
InstallmentBalance
Min. : 0
1st Qu.: 3338
Median : 14453
Mean : 24900
3rd Qu.: 32238
Max. :739371
NA's :328
Try removing the rows with missing values (or temporary remove these two variables) and run the function again to see if this solves your problem.
Also, You need to add metric = "ROC"
to the train
function and classProbs = TRUE
to trainControl()
when you use twoClassSummary
ctrl <- trainControl(method = "repeatedcv",
repeats = 3,
classProbs = TRUE,
summaryFunction = twoClassSummary) .
So, your call should be
multinomSummaryFit <- train(LoanStatus~.,
data = credit,
method = "multinom",
family=binomial,
metric = "ROC",
trControl = ctrl)
Another important issue about your dataset, you need to carefully inspect variables' values and make sure that each value makes sense. For example, the MonthsEmployed
variable has negative values. Logically, an employee has a positive number of months employed. Are these negative values wrong or do they mean something else! (for example a value of -23 means the person has not been employed for 23 month).
To answer your question regarding confusionMatrix
:
Let's say your trained model is called multinomSummaryFit
. In order to evaluate your model on the test dataset, you need to call predict
method on the test dataset without LoanStatus
(using the same variables you trained your model on), and then compare your model predictions to the actual value in LoanStatus
. For example,
#let's say your test datafrme is called test
mymodel_pred <- predict(multinomSummaryFit, test[, names(test) != "LoanStatus"])
then use confusionMatrix
:
confusionMatrix(data = mymodel_pred,
reference = test$LoanStatus,
positive = "Default")
If the test dataset does not have the LoanStatus
column then you just use:
mymodel_pred <- predict(multinomSummaryFit, test)
but in this case, you have no way to evaluate your model on the test dataset if you do not know the actual response.
Remember, if you removed any variables from the training dataset, you need to remove them also from the test dataset before you call predict
Splitting the data to train and test using stratified sampling:
trainingRows <- createDataPartition(credit$LoanStatus, p = .70, list= FALSE)
train <- credit[trainingRows, ]
test <- credit[-trainingRows, ]
Try to change class variable values from "0","1" to e.g. "A" , "B" and try then.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With