I recently needed to combine two or more variables on some data set to evaluate if their combination could enhance predictivity, thus I made some logistic regression in R. Now, on the statistic Q&A, someone suggested that I may use the linear discriminant analysis.
Since I don't have any fitcdiscr.m
in MATLAB, I'd rather go with lda in R but I cannot use the fit results to predict AUC or whatever I could use. Indeed, I see that fit output vector of lda in R is some sort of vector with multiple classes and I guess I should use fit$posterior
to predict Cases against Controls, but I cannot take those data out of it.
For further information, I get this results as fit$posterior
:
$posterior
0 1
1 0.7707927 0.22920726
2 0.7085165 0.29148352
3 0.6990989 0.30090106
4 0.5902161 0.40978387
5 0.8667109 0.13328912
6 0.6924406 0.30755939
7 0.7471086 0.25289141
8 0.7519326 0.24806736
And so on up to the last observation which is 242. Every time I try to take, for example, column 1 by fit$posterior[,1]
, I get:
1 2 3 4 5 6 7 8
0.7707927 0.7085165 0.6990989 0.5902161 0.8667109 0.6924406 0.7471086 0.7519326
9 10 11 12 13 14 15 16
0.7519326 0.6902850 0.7519326 0.8080445 0.8075360 0.8484318 0.4860899 0.8694121
I don't know which part of the code could be useful, since I made very basic computation:
library(gdata)
data=read.xls("ECGvarious.xls", perl="C:/Strawberry/perl/bin/perl.exe");
i=6;
p=19;
temp=data[,i];
temp1=data[, p];
library(MASS)
fit <- lda(Case ~ temp + temp , data=data, na.action="na.omit", CV=TRUE)
I can't link the data, anyway ECGvarious is simply an N observation x P variables, being N= N1+ N2 with N1 the number of Controls and N2 the number of Cases, and the Cases are defined as subjects who developed pathology after a follow up. The very last column of data is just 0 or 1 for Controls and Cases, respectively.
When I performed the logistic regression, I did:
mod1<-glm(Case ~ temp + temp1, data=data, family="binomial");
auctemp=auc(Case~predict(mod1), data=data);
Here's my input concerning logistic regression and prediction (I don't know much about linear discrimination but understand it's closely related to logistic regression, which I know much better). I'm not sure I'm following all of your reasoning, nor if this will be a satisfactory answer, but hopefully it won't hurt. This has been a review of some epidemiology classes for me. I hope it's not too formal and addresses at least in part some of your questions. If not, and if other users think this would better belong on Cross Validated, I won't take offense. :)
We'll first generate 200 observations, having increasing levels of probability for Case=1. The first predictor (pred1
) will follow a distribution that is nonlinear, close to the one being modeled when doing logistic regression. It will be rather closely related to the proportion of Cases. The second predictor will just be random, uniformly distributed noise.
set.seed(2351)
df <- data.frame(Case = c(sample(c(0,1), size = 67, prob = c(0.8, 0.2), replace = TRUE),
sample(c(0,1), size = 66, prob = c(0.5, 0.5), replace = TRUE),
sample(c(0,1), size = 67, prob = c(0.2, 0.8), replace = TRUE)),
pred1 = 6/(1+4*exp(-seq(from = -3, to = 5, length.out = 200))) + rnorm(n = 200, mean = 2, sd=.5),
pred2 = runif(n = 200, min = 0, max = 100))
We see in the boxplot below that the observations where case==1
generally have higher pred1
, which is intended (from the way we generated the data). At the same time, there is an overlap, otherwise it would make it too easy to decide on a cutoff point/threshold.
boxplot(pred1 ~ Case, data=df, xlab="Case", ylab="pred1")
First using both predictors:
model.1 <- glm(Case ~ pred1 + pred2, data=df, family=binomial(logit))
summary(model.1)
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -2.058258 0.479094 -4.296 1.74e-05 ***
# pred1 0.428491 0.075373 5.685 1.31e-08 ***
# pred2 0.003399 0.005500 0.618 0.537
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for binomial family taken to be 1)
#
# Null deviance: 276.76 on 199 degrees of freedom
# Residual deviance: 238.51 on 197 degrees of freedom
# AIC: 244.51
As we'd expect, the first predictor is rather strongly related, and the second, poorly related to the outcome.
Note that to get Odds Ratios from those coefficients, we need to exponentiate them:
exp(model.1$coefficients[2:3])
# pred1 pred2
# 1.534939 1.003405 # Odds Ratios (making the relationships appear more clearly).
# Use `exp(confint(model.1))` to get confidence intervals.
We'll compare this model to a simpler model, removing the second predictor:
model.2 <- glm(Case ~ pred1, data=df, family=binomial(logit))
summary(model.2)
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -1.87794 0.37452 -5.014 5.32e-07 ***
# pred1 0.42651 0.07514 5.676 1.38e-08 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for binomial family taken to be 1)
#
# Null deviance: 276.76 on 199 degrees of freedom
# Residual deviance: 238.89 on 198 degrees of freedom
# AIC: 242.89
exp(model.2$coefficients)[2]
# pred1
# 1.531907 # Odds Ratio
We could also run an anova(model.1, model.2)
, but let's skip this part and move on to prediction, keeping this simpler model as the second variable doesn't add much predictive value, if any. In practive, having more predictors is rarely a problem unless it's truly random noise, but here I focus more on the operation of predicting and choosing a proper threshold.
In the model.2
object (a list), there is an item named fitted.values
. Those values are the exact same that we'd get from predict(model.2, type="response")
and can be interpreted as probabilities; one for each row, based on the predictor(s) and their coefficient(s).
It is also possible to predict the outcome for hypothetical rows not in our initial dataframe.
With model.1
(2 predictors):
predict(model.1, newdata = list(pred1=1, pred2=42), type="response")
# 1
# 0.1843701
With model.2
(1 predictor):
predict(model.2, newdata = list(pred1=12), type="response")
# 1
# 0.96232
Looking back at the link between our predictor pred1
and the calculated probability of having Case=1
:
plot(df$pred1, model.2$fitted.values,
xlab="pred1", ylab="probability that Case=1")
We note that since we have only one predictor, the probability is a direct function of it. If we had kept the other predictor in the equation, we'd see points grouped around the same line, but in a cloud of points.
But this doesn't change the fact that if we are to evaluate how well our model can predict binary outcomes, we need to settle on a threshold above which we'll consider that the observation is a Case. Several packages have tools to help picking that threshold. But even without any additional package, we can calculate various properties over a range of thresholds using a function such as the following, which will calculate the sensitivity (ability to detect True Cases), specificity (ability to identify True Non Cases), and other properties well described here.
df.ana <- data.frame(thresh=seq(from = 0, to = 100, by = 0.5) / 100)
for(i in seq_along(df.ana$thresh)) {
df.ana$sensitivity[i] <- sum(df$Case==1 & (predict(model.2, type="resp") >= df.ana$thresh[i])) / sum(df$Case==1)
df.ana$specificity[i] <- sum(df$Case==0 & (predict(model.2, type="resp") < df.ana$thresh[i])) / sum(df$Case==0)
df.ana$pos.pred.value[i] <- sum(df$Case == 1 & (predict(model.2, type="resp") >= df.ana$thresh[i])) / sum(predict(model.2, type="resp") >= df.ana$thresh[i])
df.ana$neg.pred.value[i] <- sum(df$Case == 0 & (predict(model.2, type="resp") < df.ana$thresh[i])) / sum(predict(model.2, type="resp") < df.ana$thresh[i])
df.ana$accuracy[i] <- sum((predict(model.2, type="resp") >= df.ana$thresh[i]) == df$Case) / nrow(df)
}
which.max(df.ana$accuracy)
# [1] 46
optimal.thresh <- df.ana$thresh[which.max(df.ana$accuracy)] # 0.46
The accuracy is the proportion of correct predictions over all predictions. The 46th threshold (0.46) is the "best" for that matter. Let's check a few other neighboring rows in the generated dataframe; it tells us that 0.47 would work as well on all fronts. Fine-tuning would involve adding some new data to our initial dataframe.
df.ana[45:48,]
# thresh sensitivity specificity pos.pred.value neg.pred.value accuracy
# 45 0.45 0.7142857 0.6947368 0.7211538 0.6875000 0.705
# 46 0.46 0.7142857 0.7157895 0.7352941 0.6938776 0.715
# 47 0.47 0.7142857 0.7157895 0.7352941 0.6938776 0.715
# 48 0.48 0.7047619 0.7157895 0.7326733 0.6868687 0.710
Note that the auc
function (area under the curve) will give the same number as the accuracy for that threshold:
library(pROC)
auc(Case ~ as.numeric(predict(model.2, type="response") >= optimal.thresh), data=df)
# Area under the curve: 0.715
# thresholds against accuracy
plot(x=df.ana$thresh, y=df.ana$accuracy, type="l",
xlab="Threshold", ylab="", xlim=c(0,1), ylim=c(0,1))
text(x = 0.1, y = 0.5, labels = "Accuracy", col="black")
# thresholds against Sensitivity
lines(x=df.ana$thresh, y=df.ana$sensitivity, type="l",col="blue") # Sensitivity We want to maximize this, but not too much
text(x = 0.1, y = 0.95, labels = "Sensitivity", col="blue")
# thresholds against specificity
lines(x=df.ana$thresh, y=df.ana$specificity, type="l", col="red") # Specificity we want to maximize also, but not too much
text(x = 0.1, y = 0.05, labels = "Specificity", col="red")
# optimal threshold vertical line
abline(v=optimal.thresh)
text(x=optimal.thresh + .01, y=0.05, labels= optimal.thresh)
Incidentally, all lines converge more or less to the same point, which suggests this is a good compromise between all the qualities we look for in a predictive tool. But depending on your objectives, it might be better picking a lower or a higher threshold. Statistical tools are useful, but in the end, some other considerations are often more important in making a final decision.
The following graph is the same as the one which would be produced with pROC's roc:
plot(x=df.ana$specificity, y = df.ana$sensitivity, type="l", col="blue",
xlim = c(1,0), xlab = "Specificity", ylab = "Sensitivity")
# Equivalent to
# plot(roc(predictor=model.2$fitted.values, response = model.2$y))
The following function allows one to calculate, for a logistic model fit, the same stats seen above, and gives a 2x2 table for any chosen threshold.
diagnos.test <- function(model, threshold) {
output <- list()
output$stats <- c(
sensitivity = sum(model.1$y==1 & (predict(model, type="resp") >= threshold)) / sum(model.1$y==1),
specificity = sum(model.1$y==0 & (predict(model, type="resp") < threshold)) / sum(model.1$y==0),
pos.pr.value = sum(model.1$y==1 & (predict(model.2, type="resp") >= threshold)) / sum(predict(model.2, type="resp") >= threshold),
neg.pr.value = sum(df$Case == 0 & (predict(model.2, type="resp") < threshold)) / sum(predict(model.2, type="resp") < threshold),
accuracy = sum((predict(model.2, type="resp") >= threshold) == df$Case) / nrow(df))
output$tab <- addmargins(t(table(model$y, as.numeric(predict(model, type="response") > threshold),dnn = list("Cases", "Predictions")))[2:1,2:1])
return(output)
}
diagnos.test(model.2, 0.47)
# $stats
# sensitivity specificity pos.pr.value neg.pr.value accuracy
# 0.7142857 0.7157895 0.7352941 0.6938776 0.7150000
#
# $tab
# Cases
# Predictions 1 0 Sum
# 1 75 27 102
# 0 30 68 98
# Sum 105 95 200
Final note
I don't pretend I have covered everything on prediction, sensitivity and specificity; my goal was more to go as far as possible using common language and calculations, not relying on any specific packages.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With