Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Way to extract data from lm-object before function is applied?

Tags:

r

lm

let me directly dive into an example to show my problem:

 rm(list=ls())
 n <- 100
 df <- data.frame(y=rnorm(n), x1=rnorm(n), x2=rnorm(n) )
 fm <- lm(y ~ x1 + poly(x2, 2), data=df)

Now, I would like to have a look at the previously used data. This is almost available by using

 temp.data <- fm$model

However, x2will have been split up into poly(x2,2), which will itself be a dataframe as it contains a value for x2 and x2^2. Note that it may seem as if x2 is contained here, but since the polynomal uses orthogonal components, temp.data$x2 is not the same as df$x2. This can also be seen if you compare the variables visually after, say, the following: new.dat <- cbind(df, fm$model).

Now, to some questions:

First, and most importantly, is there a way to retrieve x2 from the lm-object in its original form. Or more generally, if some function f has been applied to some variable in the lm-formula, can the underlying variables be extracted from the lm-object (without doing case-specific math)? Note that I know I could retrieve the data by other means, but I wonder if I can extract it from the lm-object itself.

Second, on a more general note, since I did explicitly not ask for model.matrix(fm), why do I get data that has been manipulated? What is the underlying philosophy behind that? Does anyone know?

Third, the command head(new.dat) shows me that x2 has been split up in two components. What I see when I type View(new.dat) is, however, only one column. This strikes me as puzzling and mindboggling. How can two colums be represented as one, and why is there a difference between head and View? If anyone can explain, I would be highly indebted!

If these questions are too basic, please apologize. In this case, I would appreciate any pointers to relevant manuals where this is explained.

Thanks in advance!

like image 224
coffeinjunky Avatar asked Apr 07 '14 19:04

coffeinjunky


People also ask

How do I extract data from a data set in R?

To extract the data from a CSV file, you can use a built-in function available in R, i.e., read. csv(). You can extract the data by using the following command: data <- read.

Which function is used for creating a regression model from given formula lm () predict () Summary ()?

Summary: R linear regression uses the lm() function to create a regression model given some formula, in the form of Y~X+X2. To look at the model, you use the summary() function.


2 Answers

Good question, but this is difficult. fm$model is a weird data frame, of a type that would be hard for a user to construct, but which R sometimes generates internally. Check out the first few lines of str(fm$model), which show you that it's a data frame whose third component is an object of class poly with dimensions (100,2) -- i.e. something like a matrix:

## 'data.frame':    100 obs. of  3 variables:
##  $ y          : num  -0.5952 -1.9561 1.8467 -0.2782 -0.0278 ...
##  $ x1         : num  0.423 -1.539 -0.694 0.254 -0.13 ...
##  $ poly(x2, 2): poly [1:100, 1:2] 0.0606 -0.0872 0.0799 -0.1068 -0.0395 ...

If you're still working in the environment from which lm was called in the first place, and if lm was called using the data argument, you can use eval(getCall(fm)$data) to get the original data. If things are being passed in and out of functions, or if someone used lm on independent objects in the environment, you're probably out of luck. If you get in trouble you can try

eval(getCall(fm)$data,environment(formula(fm))

but things rapidly start getting harder.

I don't fully understand the logic of storing the processed model frame rather than the raw data, but I think it has to do with the construction of the terms object for the linear model -- each element in the stored model frame corresponds to an element of the terms object. I don't really understand the distinction between factors -- which are post-processed by model.matrix into sets of columns of dummy variables -- and transformed data (e.g. log(x)) or special objects like polynomial or spline bases ...

like image 125
Ben Bolker Avatar answered Oct 07 '22 00:10

Ben Bolker


The question is, how badly you need it. If you look at the structure of fm$model$poly then at the end you will see something like this:

attr(,"coefs")
attr(,"coefs")$alpha
[1] 0.06738858 0.10887048

attr(,"coefs")$norm2
[1]   1.00000 100.00000  93.96666 155.01387

I suppose these coefficients could be used to restore your original data from poly. See the source code for poly function (either page(poly) or just type poly in the console) ... it looks like computing the polynomials might be reversible. But why bother doing it? I can think of two reasons: (1) you have lost the original data and the only way to restore it is this; (2) you want to understand how R computes orthogonal polynomials.

Second, on a more general note, since I did explicitly not ask for model.matrix(fm), why do I get data that has been manipulated? What is the underlying philosophy behind that? Does anyone know?

Do you mean, why is data saved with the lm object at all? Just in case, I suppose. You can easily switch it off:

fm <- lm(y ~ x1 + poly(x2, 2), data=df, model=FALSE)

Or why are the data "manipulated"? I.e., why is poly(x2,2) saved with data instead of the original x2. My understanding is that you requested this yourself. The poly(x2,x) part is first evaluated and then passed to lm, so that lm doesn't even have original x2.

edit - to answer the comment below in a more convenient way

For instance, using factor(f) for some additional factor variable does not get translated into a data frame being stored in fm$model. Only the actual variable f is being stored in fm$model, whereas in this case with poly, some transformation is stored. This puzzles me.

I think you've missed something here and the behaviour is the same for both poly and model.

> df <- data.frame(a=1:5, b=2:6, c=rnorm(5))
> fm <- lm(c~ a + factor(b), df)
> fm$model
           c a factor(b)
1  0.5397541 1         2
2  0.9108087 2         3
3  0.1819442 3         4
4 -0.9293893 4         5
5  0.1404305 5         6
> fm$model$factor
[1] 2 3 4 5 6
Levels: 2 3 4 5 6
Warning message:
In `$.data.frame`(fm$model, factor) : Name partially matched in data frame

You can see that fm$model has factor(b) instead of b, and fm$model$factor is indeed a factor, not the original integer variable. (The warning is because the name is actually factor(b) and I used factor to avoid typing something as ugly as fm$model$'factor(b)' (replace single quotes with backquotes).

like image 45
lebatsnok Avatar answered Oct 06 '22 23:10

lebatsnok