I have a weird problem with R that I can't seem to work out.
I've tried to write a function that performs K-fold cross validation for a model chosen by the stepwise procedure in R. (I'm aware of the issues with stepwise procedures, it's purely for comparison purposes) :)
Now the issue is, that if I define the function parameters (linmod,k,direction) and run the contents of the function, it works flawlessly. BUT, if I run it as a function, I get an error saying the datas.train object can't be found.
I've tried stepping through the function with debug() and the object clearly exists, but R says it doesn't when I actually run the function. If I just fit a model using lm() it works fine, so I believe it's a problem with the step function in the loop, while inside a function. (try commenting out the step command, and set the predictions to those from the ordinary linear model.)
#CREATE A LINEAR MODEL TO TEST FUNCTION
lm.cars <- lm(mpg~.,data=mtcars,x=TRUE,y=TRUE)
#THE FUNCTION
cv.step <- function(linmod,k=10,direction="both"){
response <- linmod$y
dmatrix <- linmod$x
n <- length(response)
datas <- linmod$model
form <- formula(linmod$call)
# generate indices for cross validation
rar <- n/k
xval.idx <- list()
s <- sample(1:n, n) # permutation of 1:n
for (i in 1:k) {
xval.idx[[i]] <- s[(ceiling(rar*(i-1))+1):(ceiling(rar*i))]
}
#error calculation
errors <- R2 <- 0
for (j in 1:k){
datas.test <- datas[xval.idx[[j]],]
datas.train <- datas[-xval.idx[[j]],]
test.idx <- xval.idx[[j]]
#THE MODELS+
lm.1 <- lm(form,data= datas.train)
lm.step <- step(lm.1,direction=direction,trace=0)
step.pred <- predict(lm.step,newdata= datas.test)
step.error <- sum((step.pred-response[test.idx])^2)
errors[j] <- step.error/length(response[test.idx])
SS.tot <- sum((response[test.idx] - mean(response[test.idx]))^2)
R2[j] <- 1 - step.error/SS.tot
}
CVerror <- sum(errors)/k
CV.R2 <- sum(R2)/k
res <- list()
res$CV.error <- CVerror
res$CV.R2 <- CV.R2
return(res)
}
#TESTING OUT THE FUNCTION
cv.step(lm.cars)
Any thoughts?
This error usually occurs for one of two reasons: Reason 1: You are attempting to reference an object you have not created. Reason 2: You are running a chunk of code where the object has not been defined in that chunk.
6.2 Error: object not found This error usually occurs when your R Markdown document refers to an object that has not been defined in an R chunk at or before that chunk. You'll frequently see this when you've forgotten to copy code from your R Console sandbox back into a chunk in R Markdown.
When you created your formula, lm.cars
, in was assigned its own environment. This environment stays with the formula unless you explicitly change it. So when you extract the formula with the formula
function, the original environment of the model is included.
I don't know if I'm using the correct terminology here, but I think you need to explicitly change the environment for the formula inside your function:
cv.step <- function(linmod,k=10,direction="both"){
response <- linmod$y
dmatrix <- linmod$x
n <- length(response)
datas <- linmod$model
.env <- environment() ## identify the environment of cv.step
## extract the formula in the environment of cv.step
form <- as.formula(linmod$call, env = .env)
## The rest of your function follows
Another problem that can cause this is if one passes a character
(string vector
) to lm
instead of a formula
. vector
s have no environment
, and so when lm
converts the character
to a formula
, it apparently also has no environment
instead of being automatically assigned the local environment. If one then uses an object as weights that is not in the data argument data.frame
, but is in the local function argument, one gets a not found
error. This behavior is not very easy to understand. It is probably a bug.
Here's a minimal reproducible example. This function takes a data.frame
, two variable names and a vector of weights to use.
residualizer = function(data, x, y, wtds) {
#the formula to use
f = "x ~ y"
#residualize
resid(lm(formula = f, data = data, weights = wtds))
}
residualizer2 = function(data, x, y, wtds) {
#the formula to use
f = as.formula("x ~ y")
#residualize
resid(lm(formula = f, data = data, weights = wtds))
}
d_example = data.frame(x = rnorm(10), y = rnorm(10))
weightsvar = runif(10)
And test:
> residualizer(data = d_example, x = "x", y = "y", wtds = weightsvar)
Error in eval(expr, envir, enclos) : object 'wtds' not found
> residualizer2(data = d_example, x = "x", y = "y", wtds = weightsvar)
1 2 3 4 5 6 7 8 9 10
0.8986584 -1.1218003 0.6215950 -0.1106144 0.1042559 0.9997725 -1.1634717 0.4540855 -0.4207622 -0.8774290
It is a very subtle bug. If one goes into the function environment with browser
, one can see the weights vector just fine, but it somehow is not found in the lm
call!
The bug becomes even harder to debug if one used the name weights
for the weights variable. In this case, since lm
can't find the weights object, it defaults to the function weights()
from base thus throwing an even stranger error:
Error in model.frame.default(formula = f, data = data, weights = weights, :
invalid type (closure) for variable '(weights)'
Don't ask me how many hours it took me to figure this out.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With