In my dataset I have a number of continuous and dummy variables. For analysis with glmnet, I want the continuous variables to be standardized but not the dummy variables. I currently do this manually by first defining a dummy vector of columns that have only values of [0,1] and then using the <code>scale</code> command on all the non-dummy columns. Problem is, this isn't very elegant. But glmnet has a built in <code>standardize</code> argument. By default will this standardize the dummies too? If so, is there an elegant way to tell glmnet's <code>standardize</code> argument to skip dummies?

<code>glmnet</code> doesn't know anything about dummy variables, because it doesn't have a formula interface (and hence doesn't touch <code>model.frame</code> and <code>model.matrix</code>.) If you want them to be treated specially, you'll have to do it yourself.

How does glmnet's standardize argument handle dummy variables?

Tags:

r

machine-learning

dataset

glmnet

In my dataset I have a number of continuous and dummy variables. For analysis with glmnet, I want the continuous variables to be standardized but not the dummy variables.

I currently do this manually by first defining a dummy vector of columns that have only values of [0,1] and then using the scale command on all the non-dummy columns. Problem is, this isn't very elegant.

But glmnet has a built in standardize argument. By default will this standardize the dummies too? If so, is there an elegant way to tell glmnet's standardize argument to skip dummies?

554

asked Jul 26 '13 17:07

Dr. Beeblebrox

2 Answers

In short, yes - this will standardize the dummy variables, but there's a reason for doing so. The glmnet function takes a matrix as an input for its X parameter, not a data frame, so it doesn't make the distinction for factor columns which you may have if the parameter was a data.frame. If you take a look at the R function, glmnet codes the standardize parameter internally as

    isd = as.integer(standardize)

Which converts the R boolean to a 0 or 1 integer to feed to any of the internal FORTRAN functions (elnet, lognet, et. al.)

If you go even further by examining the FORTRAN code (fixed width - old school!), you'll see the following block:

          subroutine standard1 (no,ni,x,y,w,isd,intr,ju,xm,xs,ym,ys,xv,jerr)    989
          real x(no,ni),y(no),w(no),xm(ni),xs(ni),xv(ni)                        989
          integer ju(ni)                                                        990
          real, dimension (:), allocatable :: v                                     
          allocate(v(1:no),stat=jerr)                                           993
          if(jerr.ne.0) return                                                  994
          w=w/sum(w)                                                            994
          v=sqrt(w)                                                             995
          if(intr .ne. 0)goto 10651                                             995
          ym=0.0                                                                995
          y=v*y                                                                 996
          ys=sqrt(dot_product(y,y)-dot_product(v,y)**2)                         996
          y=y/ys                                                                997
    10660 do 10661 j=1,ni                                                       997
          if(ju(j).eq.0)goto 10661                                              997
          xm(j)=0.0                                                             997
          x(:,j)=v*x(:,j)                                                       998
          xv(j)=dot_product(x(:,j),x(:,j))                                      999
          if(isd .eq. 0)goto 10681                                              999
          xbq=dot_product(v,x(:,j))**2                                          999
          vc=xv(j)-xbq                                                         1000
          xs(j)=sqrt(vc)                                                       1000
          x(:,j)=x(:,j)/xs(j)                                                  1000
          xv(j)=1.0+xbq/vc                                                     1001
          goto 10691                                                           1002

Take a look at the lines marked 1000 - this is basically applying the standardization formula to the X matrix.

Now statistically speaking, one does not generally standardize categorical variables to retain the interpretability of the estimated regressors. However, as pointed out by Tibshirani here, "The lasso method requires initial standardization of the regressors, so that the penalization scheme is fair to all regressors. For categorical regressors, one codes the regressor with dummy variables and then standardizes the dummy variables" - so while this causes arbitrary scaling between continuous and categorical variables, it's done for equal penalization treatment.

161

answered Nov 15 '22 20:11

R_User

glmnet doesn't know anything about dummy variables, because it doesn't have a formula interface (and hence doesn't touch model.frame and model.matrix.) If you want them to be treated specially, you'll have to do it yourself.

answered Nov 15 '22 20:11

Hong Ooi

Related questions
                            
                                Custom pipe to silence warnings
                            
                                R: Error in get_map()/get_googlemap() from ggmap
                            
                                Information Dashboards in R with ggplot2
                            
                                trouble installing rpy2 on win7 (R 2.12, Python 2.5)
                            
                                What hardware limits plotting speed in R?
                            
                                Data inside a function (package creation)
                            
                                data.table vs plyr regression output
                            
                                Using Dates with RSQLite
                            
                                Multidimensional sparse array (3-way tensor) in R
                            
                                Printing p-values with <0.001
                            
                                data.table function works in script but not in package
                            
                                How to build a pdf vignette in R and RStudio
                            
                                How to get line breaks in equation when knitting to pdf?
                            
                                size legend for plotly bubble map/chart
                            
                                Bookdown: Fix extra space before Chinese string inside R code chunk
                            
                                How does doRedis work?
                            
                                R / Sweave formatting numbers with \Sexpr{} in scientific notation
                            
                                how to skip through a loop when debugging R code
                            
                                ggplot legend showing transparency and fill color
                            
                                Advanced error handling

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With