In my dataset I have a number of continuous and dummy variables. For analysis with glmnet, I want the continuous variables to be standardized but not the dummy variables.
I currently do this manually by first defining a dummy vector of columns that have only values of [0,1] and then using the scale
command on all the non-dummy columns. Problem is, this isn't very elegant.
But glmnet has a built in standardize
argument. By default will this standardize the dummies too? If so, is there an elegant way to tell glmnet's standardize
argument to skip dummies?
For example, many people don't like to standardize dummy variables, which only have values of 0 and 1, because a “one standard deviation increase” isn't something that could actually happen with such a variable. Ergo, you might want to leave the dummy variables unstandardized while standardizing continuous X variables.
Yes it is perfectly possible. You need to include the interaction term into your model. The type of the model will depend on the type of dependent variable and your hypothesis.
In statistics and econometrics, particularly in regression analysis, a dummy variable is one that takes only the value 0 or 1 to indicate the absence or presence of some categorical effect that may be expected to shift the outcome.
Normalization/standardization of features is done to bring all features to a similar scale. When you one hot encode categorical variables they are either 0/1 hence there is not much scale difference like 10~1000 hence there is no need to apply techniques for normalization/standardization.
In short, yes - this will standardize the dummy variables, but there's a reason for doing so. The glmnet
function takes a matrix as an input for its X
parameter, not a data frame, so it doesn't make the distinction for factor
columns which you may have if the parameter was a data.frame
. If you take a look at the R function, glmnet codes the standardize
parameter internally as
isd = as.integer(standardize)
Which converts the R boolean to a 0 or 1 integer to feed to any of the internal FORTRAN functions (elnet, lognet, et. al.)
If you go even further by examining the FORTRAN code (fixed width - old school!), you'll see the following block:
subroutine standard1 (no,ni,x,y,w,isd,intr,ju,xm,xs,ym,ys,xv,jerr) 989
real x(no,ni),y(no),w(no),xm(ni),xs(ni),xv(ni) 989
integer ju(ni) 990
real, dimension (:), allocatable :: v
allocate(v(1:no),stat=jerr) 993
if(jerr.ne.0) return 994
w=w/sum(w) 994
v=sqrt(w) 995
if(intr .ne. 0)goto 10651 995
ym=0.0 995
y=v*y 996
ys=sqrt(dot_product(y,y)-dot_product(v,y)**2) 996
y=y/ys 997
10660 do 10661 j=1,ni 997
if(ju(j).eq.0)goto 10661 997
xm(j)=0.0 997
x(:,j)=v*x(:,j) 998
xv(j)=dot_product(x(:,j),x(:,j)) 999
if(isd .eq. 0)goto 10681 999
xbq=dot_product(v,x(:,j))**2 999
vc=xv(j)-xbq 1000
xs(j)=sqrt(vc) 1000
x(:,j)=x(:,j)/xs(j) 1000
xv(j)=1.0+xbq/vc 1001
goto 10691 1002
Take a look at the lines marked 1000 - this is basically applying the standardization formula to the X
matrix.
Now statistically speaking, one does not generally standardize categorical variables to retain the interpretability of the estimated regressors. However, as pointed out by Tibshirani here, "The lasso method requires initial standardization of the regressors, so that the penalization scheme is fair to all regressors. For categorical regressors, one codes the regressor with dummy variables and then standardizes the dummy variables" - so while this causes arbitrary scaling between continuous and categorical variables, it's done for equal penalization treatment.
glmnet
doesn't know anything about dummy variables, because it doesn't have a formula interface (and hence doesn't touch model.frame
and model.matrix
.) If you want them to be treated specially, you'll have to do it yourself.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With