Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract all model statistics from rms fits?

Tags:

r

statistics

The rms package contains a wealth of useful statistical functions. However, I cannot find a proper way to extract certain fit statistics from the fitted object. Consider an example:

library(pacman)
p_load(rms, stringr, readr)

#fit
> (fit = rms::ols(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species, data = iris))
Linear Regression Model

 rms::ols(formula = Sepal.Length ~ Sepal.Width + Petal.Length + 
     Petal.Width + Species, data = iris)

                Model Likelihood     Discrimination    
                   Ratio Test           Indexes        
 Obs     150    LR chi2    302.96    R2       0.867    
 sigma0.3068    d.f.            5    R2 adj   0.863    
 d.f.    144    Pr(> chi2) 0.0000    g        0.882    

 Residuals

       Min        1Q    Median        3Q       Max 
 -0.794236 -0.218743  0.008987  0.202546  0.731034 


                    Coef    S.E.   t     Pr(>|t|)
 Intercept           2.1713 0.2798  7.76 <0.0001 
 Sepal.Width         0.4959 0.0861  5.76 <0.0001 
 Petal.Length        0.8292 0.0685 12.10 <0.0001 
 Petal.Width        -0.3152 0.1512 -2.08 0.0389  
 Species=versicolor -0.7236 0.2402 -3.01 0.0031  
 Species=virginica  -1.0235 0.3337 -3.07 0.0026 

So, the print function for the fit prints a lot of useful stuff including standard errors and adjusted R2. Unfortunately, if we inspect the model fit object, the values don't seem to be present anywhere.

> str(fit)
List of 19
 $ coefficients     : Named num [1:6] 2.171 0.496 0.829 -0.315 -0.724 ...
  ..- attr(*, "names")= chr [1:6] "Intercept" "Sepal.Width" "Petal.Length" "Petal.Width" ...
 $ residuals        : Named num [1:150] 0.0952 0.1432 -0.0731 -0.2894 -0.0544 ...
  ..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
 $ effects          : Named num [1:150] -71.5659 -1.1884 9.1884 -1.3724 -0.0587 ...
  ..- attr(*, "names")= chr [1:150] "Intercept" "Sepal.Width" "Petal.Length" "Petal.Width" ...
 $ rank             : int 6
 $ fitted.values    : Named num [1:150] 5 4.76 4.77 4.89 5.05 ...
  ..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
 $ assign           :List of 4
  ..$ Sepal.Width : int 2
  ..$ Petal.Length: int 3
  ..$ Petal.Width : int 4
  ..$ Species     : int [1:2] 5 6
 $ qr               :List of 5
  ..$ qr   : num [1:150, 1:6] -12.2474 0.0816 0.0816 0.0816 0.0816 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:150] "1" "2" "3" "4" ...
  .. .. ..$ : chr [1:6] "Intercept" "Sepal.Width" "Petal.Length" "Petal.Width" ...
  ..$ qraux: num [1:6] 1.08 1.02 1.11 1.02 1.02 ...
  ..$ pivot: int [1:6] 1 2 3 4 5 6
  ..$ tol  : num 1e-07
  ..$ rank : int 6
  ..- attr(*, "class")= chr "qr"
 $ df.residual      : int 144
 $ var              : num [1:6, 1:6] 0.07828 -0.02258 -0.00198 0.01589 -0.02837 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:6] "Intercept" "Sepal.Width" "Petal.Length" "Petal.Width" ...
  .. ..$ : chr [1:6] "Intercept" "Sepal.Width" "Petal.Length" "Petal.Width" ...
 $ stats            : Named num [1:6] 150 302.964 5 0.867 0.882 ...
  ..- attr(*, "names")= chr [1:6] "n" "Model L.R." "d.f." "R2" ...
 $ linear.predictors: Named num [1:150] 5 4.76 4.77 4.89 5.05 ...
  ..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
 $ call             : language rms::ols(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species, data = iris)
 $ terms            :Classes 'terms', 'formula'  language Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species
  .. ..- attr(*, "variables")= language list(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species)
  .. ..- attr(*, "factors")= int [1:5, 1:4] 0 1 0 0 0 0 0 1 0 0 ...
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:5] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" ...
  .. .. .. ..$ : chr [1:4] "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
  .. ..- attr(*, "term.labels")= chr [1:4] "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
  .. ..- attr(*, "order")= int [1:4] 1 1 1 1
  .. ..- attr(*, "intercept")= num 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. ..- attr(*, "predvars")= language list(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species)
  .. ..- attr(*, "dataClasses")= Named chr [1:5] "numeric" "numeric" "numeric" "numeric" ...
  .. .. ..- attr(*, "names")= chr [1:5] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" ...
  .. ..- attr(*, "formula")=Class 'formula'  language Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species
  .. .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
 $ Design           :List of 12
  ..$ name        : chr [1:4] "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
  ..$ label       : chr [1:4] "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
  ..$ units       : Named chr [1:4] "" "" "" ""
  .. ..- attr(*, "names")= chr [1:4] "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
  ..$ colnames    : chr [1:5] "Sepal.Width" "Petal.Length" "Petal.Width" "Species=versicolor" ...
  ..$ mmcolnames  : chr [1:5] "Sepal.Width" "Petal.Length" "Petal.Width" "Speciesversicolor" ...
  ..$ assume      : chr [1:4] "asis" "asis" "asis" "category"
  ..$ assume.code : int [1:4] 1 1 1 5
  ..$ parms       :List of 1
  .. ..$ Species: chr [1:3] "setosa" "versicolor" "virginica"
  ..$ limits      : list()
  ..$ values      : list()
  ..$ nonlinear   :List of 4
  .. ..$ Sepal.Width : logi FALSE
  .. ..$ Petal.Length: logi FALSE
  .. ..$ Petal.Width : logi FALSE
  .. ..$ Species     : logi [1:2] FALSE FALSE
  ..$ interactions: NULL
 $ non.slopes       : num 1
 $ na.action        : NULL
 $ scale.pred       : chr "Sepal.Length"
 $ fail             : logi FALSE
 $ sformula         :Class 'formula'  language Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
 - attr(*, "class")= chr [1:3] "ols" "rms" "lm"

There is a 7 year old question on R help where the package creator explains a solution to getting these:

On Wed, 11 Aug 2010, david dav wrote:

Hi, I would like to extract the coefficients of a logistic regression (estimates and standard error as well) in lrm as in glm with

summary(fit.glm)$coef

Thanks David

coef(fit) sqrt(diag(vcov(fit)))

But these will not be very helpful except in the trivial case where everything is linear, nothing interacts, and factors have two levels.

Frank

And the solution is according to the author not optimal. This leaves one wondering how the displayed values are calculated. Tracing down the code results in a hunt through the undocumented package code (the package code is on Github). I.e. we begin with print.ols():

> rms:::print.ols
function (x, digits = 4, long = FALSE, coefs = TRUE, title = "Linear Regression Model", 
    ...) 
{
    latex <- prType() == "latex"
    k <- 0
    z <- list()
    if (length(zz <- x$na.action)) {
        k <- k + 1
        z[[k]] <- list(type = paste("naprint", class(zz)[1], 
            sep = "."), list(zz))
    }
    stats <- x$stats
...

Reading further we do find that e.g. R2 adj. is calculated in the print function:

rsqa <- 1 - (1 - r2) * (n - 1) / rdf

We also find some standard error calculations, though no p values.

  se <- sqrt(diag(x$var))
  z[[k]] <- list(type='coefmatrix',
                 list(coef    = x$coefficients,
                      se      = se,
                      errordf = rdf))

All the results are passed down further to prModFit(). We can look it up and find the p value calculation etc. Unfortunately, the print command returns NULL so these values are not available anywhere for programmatic reuse:

> x = print((fit = rms::ols(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species, data = iris)))
#printed output...
> x
NULL

How does one get all the statistics?

like image 757
CoderGuy123 Avatar asked Dec 09 '17 00:12

CoderGuy123


2 Answers

Here is a hack solution where we capture the output of the print command:

#parser
get_model_stats = function(x, precision=60) {

  # remember old number formatting function
  # (which would round and transforms p-values to formats like "<0.01")
  old_format_np = rms::formatNP
  # substitute it with a function which will print out as many digits as we want
  assignInNamespace("formatNP", function(x, ...) formatC(x, format="f", digits=precision), "rms")

  # remember old width setting
  old_width = options('width')$width
  # substitute it with a setting making sure the table will not wrap
  options(width=old_width + 4 * precision)

  # actually print the data and capture it
  cap = capture.output(print(x))

  # restore original settings
  options(width=old_width)
  assignInNamespace("formatNP", old_format_np, "rms")
  
  #model stats
  stats = c()
  stats$R2.adj = str_match(cap, "R2 adj\\s+ (\\d\\.\\d+)") %>% na.omit() %>% .[, 2] %>% as.numeric()
  
  #coef stats lines
  coef_lines = cap[which(str_detect(cap, "Coef\\s+S\\.E\\.")):(length(cap) - 1)]
  
  #parse
  coef_lines_table = suppressWarnings(readr::read_table(coef_lines %>% stringr::str_c(collapse = "\n")))
  colnames(coef_lines_table)[1] = "Predictor"
  
  list(
    stats = stats,
    coefs = coef_lines_table
  )
}

Example:

> get_model_stats(fit)
$stats
$stats$R2.adj
[1] 0.86


$coefs
# A tibble: 6 x 5
           Predictor  Coef  S.E.     t `Pr(>|t|)`
               <chr> <dbl> <dbl> <dbl>      <chr>
1          Intercept  2.17 0.280   7.8    <0.0001
2        Sepal.Width  0.50 0.086   5.8    <0.0001
3       Petal.Length  0.83 0.069  12.1    <0.0001
4        Petal.Width -0.32 0.151  -2.1     0.0389
5 Species=versicolor -0.72 0.240  -3.0     0.0031
6  Species=virginica -1.02 0.334  -3.1     0.0026

This still has issues, e.g. p values are not returned as numerics and only has 4 digits, which can cause issues in some situations. The updated code should extract digits up to arbitrary precision.

Be extra careful when using this with long variable names as those could wrap the table into multiple rows and introduce missing values (NA) in output even though the stats are in there!

like image 55
CoderGuy123 Avatar answered Nov 02 '22 19:11

CoderGuy123


Package broom is a great way to extract model info.

library(pacman)
library(rms)
library(broom)

fit = ols(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species, 
               data = iris)

tidy(summary.lm(fit))

#                 term   estimate  std.error statistic      p.value
# 1          Intercept  2.1712663 0.27979415  7.760227 1.429502e-12
# 2        Sepal.Width  0.4958889 0.08606992  5.761466 4.867516e-08
# 3       Petal.Length  0.8292439 0.06852765 12.100867 1.073592e-23
# 4        Petal.Width -0.3151552 0.15119575 -2.084418 3.888826e-02
# 5 Species=versicolor -0.7235620 0.24016894 -3.012721 3.059634e-03
# 6  Species=virginica -1.0234978 0.33372630 -3.066878 2.584344e-03

glance(fit)

#   r.squared adj.r.squared     sigma statistic      p.value df    logLik      AIC      BIC df.residual
# 1 0.8673123      0.862705 0.3068261   188.251 2.666942e-61  6 -32.55801 79.11602 100.1905         144

The object fit also contains some easily accessible info that you can get and store in a dataframe:

fit$coefficients

# Intercept        Sepal.Width       Petal.Length        Petal.Width Species=versicolor  Species=virginica 
# 2.1712663          0.4958889          0.8292439         -0.3151552         -0.7235620         -1.0234978

fit$stats

#           n  Model L.R.        d.f.          R2           g       Sigma 
# 150.0000000 302.9635115   5.0000000   0.8673123   0.8820479   0.3068261
like image 1
AntoniosK Avatar answered Nov 02 '22 19:11

AntoniosK