I am new to R and I am stuck with a problem. I am trying to read a set of data in a table and I want to perform linear modeling. Below is how I read my data and my variables names: <pre class="prettyprint"><code>>data =read.table(datafilename,header=TRUE) >names(data) [1] "price" "model" "size" "year" "color" </code></pre> What I want to do is create several linear models using different combinations of the variables (price being the target ), such as: <pre class="prettyprint"><code>> attach(data) > model1 = lm(price~model+size) > model2 = lm(price~model+year) > model3 = lm(price~model+color) > model4 = lm(price~model+size) > model4 = lm(price~size+year+color) #... and so on for all different combination... </code></pre> My main aim is to compare the different models. Is there a more clever way to generate these models instead of hard coding the variables, especially that the number of my variables in some cases will increase to 13 or so.

Here's one way to get all of the combinations of variables using the <code>combn</code> function. It's a bit messy, and uses a loop (perhaps someone can improve on this with <code>mapply</code>): <pre class="prettyprint"><code>vars <- c("price","model","size","year","color") N <- list(1,2,3,4) COMB <- sapply(N, function(m) combn(x=vars[2:5], m)) COMB2 <- list() k=0 for(i in seq(COMB)){ tmp <- COMB[[i]] for(j in seq(ncol(tmp))){ k <- k + 1 COMB2[[k]] <- formula(paste("price", "~", paste(tmp[,j], collapse=" + "))) } } </code></pre> Then, you can call these formulas and store the model objects using a <code>list</code> or possibly give unique names with the <code>assign</code> function: <pre class="prettyprint"><code>res <- vector(mode="list", length(COMB2)) for(i in seq(COMB2)){ res[[i]] <- lm(COMB2[[i]], data=data) } </code></pre>

You can use <code>stepwise multiple regression</code> to determine what variables make sense to include. To get this started you write one <code>lm()</code> statement with all variables, such as: <pre class="prettyprint"><code>library(MASS) fit <- lm(price ~ model + size + year + color) </code></pre> Then you continue with: <pre class="prettyprint"><code>step <- stepAIC(model, direction="both") </code></pre> Finally, you can use to following to show the results: <pre class="prettyprint"><code>step$anova </code></pre> Hope this gives you some inspiration for advancing your script.

Linear models in R with different combinations of variables

Tags:

variables

r

automation

lm

I am new to R and I am stuck with a problem. I am trying to read a set of data in a table and I want to perform linear modeling. Below is how I read my data and my variables names:

>data =read.table(datafilename,header=TRUE)
>names(data)
[1] "price"     "model"     "size"   "year"   "color"

What I want to do is create several linear models using different combinations of the variables (price being the target ), such as:

> attach(data)
> model1 = lm(price~model+size)
> model2 = lm(price~model+year)
> model3 = lm(price~model+color)
> model4 = lm(price~model+size)
> model4 = lm(price~size+year+color)
#... and so on for all different combination...

My main aim is to compare the different models. Is there a more clever way to generate these models instead of hard coding the variables, especially that the number of my variables in some cases will increase to 13 or so.

378

asked Apr 09 '14 07:04

Reyhaneh

3 Answers

If your goal is model selection there are several tools available in R which attempt to automate this process. Read the documentation on dredge(...) in the MuMIn package.

# dredge: example of use
library(MuMIn)
df <- mtcars[,c("mpg","cyl","disp","hp","wt")]  # subset of mtcars
full.model <- lm(mpg ~ cyl+disp+hp+wt,df)       # model for predicting mpg
dredge(full.model)
# Global model call: lm(formula = mpg ~ cyl + disp + hp + wt, data = df)
# ---
# Model selection table 
#    (Intrc)     cyl      disp       hp     wt df   logLik  AICc delta weight
# 10   39.69 -1.5080                    -3.191  4  -74.005 157.5  0.00  0.291
# 14   38.75 -0.9416           -0.01804 -3.167  5  -72.738 157.8  0.29  0.251
# 13   37.23                   -0.03177 -3.878  4  -74.326 158.1  0.64  0.211
# 16   40.83 -1.2930  0.011600 -0.02054 -3.854  6  -72.169 159.7  2.21  0.096
# 12   41.11 -1.7850  0.007473          -3.636  5  -73.779 159.9  2.37  0.089
# 15   37.11         -0.000937 -0.03116 -3.801  5  -74.321 161.0  3.46  0.052
# 11   34.96         -0.017720          -3.351  4  -78.084 165.6  8.16  0.005
# 9    37.29                            -5.344  3  -80.015 166.9  9.40  0.003
# 4    34.66 -1.5870 -0.020580                  4  -79.573 168.6 11.14  0.001
# 7    30.74         -0.030350 -0.02484         4  -80.309 170.1 12.61  0.001
# 2    37.88 -2.8760                            3  -81.653 170.2 12.67  0.001
# 8    34.18 -1.2270 -0.018840 -0.01468         5  -79.009 170.3 12.83  0.000
# 6    36.91 -2.2650           -0.01912         4  -80.781 171.0 13.55  0.000
# 3    29.60         -0.041220                  3  -82.105 171.1 13.57  0.000
# 5    30.10                   -0.06823         3  -87.619 182.1 24.60  0.000
# 1    20.09                                    2 -102.378 209.2 51.68  0.000

You should consider these tools to help you make intelligent decisions. Do not let the tool make the decision for you!!!

For example, in this case dredge(...) suggests that the "best" model for predicting mpg, based on the AICc criterion, includes cyl and wt. But note that AICc for this model is 157.7 whereas the second best model has an AICc of 157.8, so these are basically the same. In fact, the first 5 models in this list are not significantly different in their ability to predict mpg. It does, however, narrow things down a bit. Among these 5, I would want to look at distribution of residuals (should be normal), trends in residuals (there should be none), and leverage (do some points have undue influence), before picking a "best" model.

135

answered Nov 08 '22 01:11

jlhoward

Here's one way to get all of the combinations of variables using the combn function. It's a bit messy, and uses a loop (perhaps someone can improve on this with mapply):

vars <- c("price","model","size","year","color")
N <- list(1,2,3,4)
COMB <- sapply(N, function(m) combn(x=vars[2:5], m))
COMB2 <- list()
k=0
for(i in seq(COMB)){
    tmp <- COMB[[i]]
    for(j in seq(ncol(tmp))){
        k <- k + 1
        COMB2[[k]] <- formula(paste("price", "~", paste(tmp[,j], collapse=" + ")))
    }
}

Then, you can call these formulas and store the model objects using a list or possibly give unique names with the assign function:

res <- vector(mode="list", length(COMB2))
for(i in seq(COMB2)){
    res[[i]] <- lm(COMB2[[i]], data=data)
}

answered Nov 08 '22 00:11

Marc in the box

You can use stepwise multiple regression to determine what variables make sense to include. To get this started you write one lm() statement with all variables, such as:

library(MASS)
fit <- lm(price ~ model + size + year + color)

Then you continue with:

step <- stepAIC(model, direction="both")

Finally, you can use to following to show the results:

step$anova

Hope this gives you some inspiration for advancing your script.

answered Nov 08 '22 01:11

Jochem

Related questions
                            
                                Multiply values across each column by weight in another data.frame in R
                            
                                Convert table into matrix by column names [duplicate]
                            
                                Remove anything within a pair of parentheses using gsub in R
                            
                                Write using mouse on R plot?
                            
                                R repeat elements of data frame
                            
                                Dummy for first new element in a series
                            
                                adding spread data to dotplots in R
                            
                                I want to run a R code at a specific time
                            
                                How to replace '(' , ')' using sub in R?
                            
                                How to change Xlab,Ylab and values of XY-axis color and font size in R plot
                            
                                Aggregate data in R
                            
                                Vertical lines between points with ggplot2
                            
                                recoding data in r
                            
                                Problems with VennDiagram?
                            
                                How to Unearth the Buried Regression Line in GGPLOT
                            
                                How do you replace a whole row of a data.table with NA?
                            
                                Set value of --args from within R session
                            
                                Average of values in columns in dataframe?
                            
                                Error Installing minqa in R/3.0.2
                            
                                Getting the most frequent element in a factor in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With