I want to standardize the variables of a biological dataset. I need to run glm's, glm.nb's and lm's using different response variables.
The dataset contains counts of a given tree species by plots (all the the plots have the same size) and a series of qualitative variables: vegetation type, soil type and presence/absence of cattle.
DATA
library(standardize)
library(AICcmodavg)
set.seed(1234)
# Short version of the dataset missing other response variables
dat <- data.frame(Plot_ID = 1:80,
Ct_tree = sample(x = 1:400, replace = T),
Veg = sample(x = c("Dry", "Wet", "Mixed"), size = 80, replace = T),
Soil = sample(x = c("Clay", "Sandy", "Rocky"), size = 80, replace = T),
Cattle = rep(x = c("Yes", "No"), each = 5))
PROBLEM
As all the explanatory variables are categorical, I'm not sure whether it is possible to produce standardized lm models with standardized coefficients and standardized standard errors.
If I try to standardize through base R using scale(), I get an error because the explanatory variables are not numeric. I am trying to use the standardize R package, but I am not sure whether this is doing what I need.
MODELS
m1 <- standardize(formula = Ct_tree ~ 1, data = dat, family = "gaussian", scale = 1)
# Error in standardize(formula = Ct_tree ~ 1, data = dat, family = "gaussian": no variables in formula
m2 <- standardize(formula = Ct_tree ~ Veg, data = dat, family = "gaussian", scale = 1)
m3 <- standardize(formula = Ct_tree ~ Soil, data = dat, family = "gaussian", scale = 1)
m4 <- standardize(formula = Ct_tree ~ Cattle, data = dat, family = "gaussian", scale = 1)
m5 <- standardize(formula = Ct_tree ~ Veg + Soil, data = dat, family = "gaussian", scale = 1)
m6 <- standardize(formula = Ct_tree ~ Veg + Cattle, data = dat, family = "gaussian", scale = 1)
m7 <- standardize(formula = Ct_tree ~ Soil + Cattle, data = dat, family = "gaussian", scale = 1)
m8 <- standardize(formula = Ct_tree ~ Veg + Soil + Cattle, data = dat, family = "gaussian", scale = 1)
# m1_st <- standardize(formula = m1$formula, data = m1$data)
m2_st <- lm(formula = m2$formula, data = m2$data)
# [...]
m8_st <- lm(formula = m8$formula, data = m8$data)
# Produce a summary table of AICs
models <- list(Veg = m2_st, Soil = m3_st, Cattle = m4_st, VegSoil = m5_st, VegCattle = m6_st, SoilCattle = m7_st, VegSoilCattle = m8_st)
aic_tbl <- aictab(models, second.ord = TRUE, sort = TRUE)
QUESTIONS
1) Am I implementing the standardize package correctly?
2) Is my code doing the standardization that I am after?
3) When I call mi$data, it looks like the response variable (Ct_tree) has been standardized. Is this what it is supposed to happen? I thought that the standardization would happen to the explanatory variables, not the response.
4) How can I standardize the intercept (Ct_tree ~ 1)? Maybe it does not need to be standardized, but I still need it in the final AIC table to compare all the models.
5) I also have other response variables that are absence/presence (recoded as 0 and 1 respectively). Is it statistically correct to also standardize these columns using the same process as above? The standardize package produces a presence/absence column identical to the original. However, if I rescale such column by means of the function scale() from base R, the numbers produced are positive and negative, with decimals, and I cannot apply a binomial family.
6) If I recode the qualitative explanatory variables as ordinal (e.g. Soil = 0 for clay, 1 for sandy, 2 for rocky), and then scale them, would that be statistically correct?
Therefore, standardising (or any linear transformation of the variables examined for that matter) will not change the correlation as any prior rescaling effect that might affect the covariance, will be nullified by the scale normalisation that gives the final correlation estimate.
Centering the variables and standardizing them will both reduce the multicollinearity. However, standardizing changes the interpretation of the coefficients.
Linear modelsLogistic regression requires normalization as well in order to avoid the vanishing gradient problem during the training phase. If you train a linear regression without previous normalization, you can't use the coefficients as indicators of feature importance.
My answer could be opinionated. In addition, I am a biologist and not a mathematician, so somebody with better mathematical background could give a more reasoned answer.
The first question is why do we need standardization? Basically, we use it in order to compare the effect size of different predictors. Let's say we want to estimate how plant mass (M) depends on the concentration of nitrogen in soil (N) and availability of water (W). There will be two predictors with different units and different amplitude. It is very important that both predictors are continuous. We can estimate regression coefficients from raw data. Let's assume that final biomass could be represented as
M = 0.1 * N + 0.2 * W + error
So, which factor is more important? Of course, we can not deduce that only from those coefficients. For comparison, we need to take into account units and variability of factors. Therefore reporting coefficients only, could be insufficient for understanding your foundings. The standardization could be a solution for such a situation.
Now let's assume that we got the same regression coefficients, but predictors were previously standardized. In such a situation it is clear that the plant mass changes by 0.1 when the nitrogen concentration changes by 1 standard deviation unit. The same is for water (0.2 mass unit per 1 sd unit). If your experiment incorporated wide ranges of water and nitrogen conditions, you can suggest that water is twice more important then nitrogen. So standardization is useful for comparing the effects of continuous predictors.
In your case, predictors are categorical i.e. factors. And your initial question is "does trees number differ in different condition groups?". Here your result will be an amount of difference. For example, there are on average 50 more trees per plot on clay soil than on sandy soil. It is quite a clear result. If some condition is responsible for a higher change in trees number it has a stronger effect. So it seems that standardization is not required.
Still, you can ask an additional question "Is the difference of 50 trees is a high difference?". If the average number of trees is 10000, the addition of 50 trees is neglectable. But if there are on average 100 trees per plot, the changes are really big. In order to deal with such a problem, you can standardize your response variable. Thus, you will get the difference in standard deviation units (something like Cohen's d).
In any case, the choice to standardize or not to standardize should be your decision, based on your expertise in the field. If standardization will help you to interpret your results, then do it. If you think that the difference in standard deviation units is more illustrative and will be more understandable for your readers then do it. As for me, I'd suggest staying with raw values but presenting results in relative units (%). For example, there are 15% more trees on clay than on sand soils. But again it should be your decision.
As a conclusion:
base::scale
will be enough.P.S. sorry for bad grammar.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With