Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using R's lm on a dataframe with a list of predictors

Tags:

r

I have a dataframe with let's say N+2 columns. The first is just dates (mainly used for plotting later on), the second is a variable whose response to the remaining N columns I would like to compute. I'm thinking there must be something like

df = data.frame(y = 1:10, x1 = runif(10), x2 = rnorm(10))
fit = lm(y~df[,2:3],data=df)

This doesn't work. I've also tried and failed with

fit = lm(y~sapply(colnames(df)[2:3],as.name),data=df)

Any thoughts?

like image 308
josh Avatar asked Aug 16 '12 16:08

josh


People also ask

What does the R function lm () do?

The lm() function is used to fit linear models to data frames in the R Language. It can be used to carry out regression, single stratum analysis of variance, and analysis of covariance to predict the value corresponding to data that is not in the data frame.

Does order matter in lm R?

The order is not important for the summary of the linear model (which is based on t-tests that don't change). You can see this in your output which is the same. Note the different p-values for the factors b and c.


3 Answers

Using the formula notation y ~ . specifies that you want to regress y on all of the other variables in the dataset.

df = data.frame(y = 1:10, x1 = runif(10), x2 = rnorm(10))
# fits a model using x1 and x2
fit <- lm(y ~ ., data = df) 
# Removes the column containing x1 so regression on x2 only
fit <- lm(y ~ ., data = df[, -2]) 
like image 158
Dason Avatar answered Oct 24 '22 17:10

Dason


There is an alternative to Dason's answer, for when you want to specify the columns, to exclude, by name. It is to use subset(), and specify the select argument:

df = data.frame(y = 1:10, x1 = runif(10), x2 = rnorm(10))
fit = lm(y ~ ., data = subset(df, select=-x1))

Trying to use data[,-c("x1")] fails with "invalid argument to unary operator".

It can extend to excluding multiple columns: subset(df, select = -c(x1,x2))

And you can still use numeric columns:

df = data.frame(y = 1:10, x1 = runif(10), x2 = rnorm(10))
fit = lm(y ~ ., data = subset(df, select = -2))

(That is equivalent to subset(df, select=-x1) because x1 is the 2nd column.)

Naturally you can also use this to specify the columns to include.

df = data.frame(y = 1:10, x1 = runif(10), x2 = rnorm(10))
fit = lm(y ~ ., data = subset(df, select=c(y,x2)) )

(Yes, that is equivalent to lm(y ~ x2, df) but is distinct if you were then going to be using step(), for instance.)

like image 39
Darren Cook Avatar answered Oct 24 '22 17:10

Darren Cook


I am fairly new to R, but I found another way to do this for named columns in a data frame. Say you want to run regression using all columns except for column x2, then you'll write:

df = data.frame(y = 1:10, x1 = runif(10), x2 = rnorm(10))
# Removes the column containing x2 so regression on x1 only
model <- lm(Y ~ . - x2, data = df)
# to remove more columns (assuming there were more columns in the data frame)
model <- lm(Y ~ . - x2 - x3 - x4, data = df)

The rest of the answers are pretty old, so maybe it's a new feature, but it's pretty neat!

like image 38
kumarharsh Avatar answered Oct 24 '22 17:10

kumarharsh