Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

K-fold cross-validation using cv.lm()

Tags:

r

I am new to R and trying to do K-fold cross validation using cv.lm() Refer: http://www.statmethods.net/stats/regression.html

I am getting error indicating the length of my variable are different. Infact during my verification using length(), I found the size in fact the same.

The below are the minimal datasets to replicate the problem,

X   Y
277 5.20
285 5.17
297 4.96
308 5.26
308 5.11
263 5.27
278 5.20
283 5.16
268 5.17
250 5.20
275 5.18
274 5.09
312 5.03
294 5.21
279 5.29
300 5.14
293 5.09
298 5.16
290 4.99
273 5.23
289 5.32
279 5.21
326 5.14
293 5.22
256 5.15
291 5.09
283 5.09
284 5.07
298 5.27
269 5.19

Used the below code to do the cross-validation

# K-fold cross-validation, with K=10
sampledata <- read.table("H:/sample.txt", header=TRUE)
y.1 <- sampledata$Y
x.1 <- sampledata$X
fit=lm(y.1 ~ x.1)
library(DAAG)
cv.lm(df=sampledata, fit, m=10)

The error on the terminal,

Error in model.frame.default(formula = form, data = df[rows.in, ], drop.unused.levels = TRUE) : 
  variable lengths differ (found for 'x.1')

Verification,

> length(x.1)
[1] 30
> length(y.1)
[1] 30

The above confirms the length are the same.

> str(x.1)
 int [1:30] 277 285 297 308 308 263 278 283 268 250 ...
> str(y.1)
 num [1:30] 5.2 5.17 4.96 5.26 5.11 5.27 5.2 5.16 5.17 5.2 ...

> is(y.1)
[1] "numeric" "vector" 
> is(x.1)
[1] "integer"             "numeric"             "vector"              "data.frameRowLabels"

Further check on the data set as above indicates one dataset is integer and another is numeric. But even when the data sets are converted the numeric to integer or integer to numeric, the same error pops up in the screen indicating issues with data length.

Can you guide me what should I do to correct the error?

I am unsuccessful in handling this since 2 days ago. Did not get any good lead from my research using internet.

Addional Related Query:

I see the fit works if we use the headers of the data set in the attributes,

fit=lm(Y ~ X, data=sampledata)

a) what is the difference of the above syntax with,

fit1=lm(sampledata$Y ~ sampledata$X)

Thought it is the same. In the below,

#fit 1 works
fit1=lm(Y ~ X, data=sampledata)
cv.lm(df=sampledata, fit1, m=10)

#fit 2 does not work
fit2=lm(sampledata$Y ~ sampledata$X)
cv.lm(df=sampledata, fit2, m=10)

The problem is at df=sampledata as the header "sampledata$Y" does not exist but only $Y exist. Tried to manupulate cv.lm to below it does not work too,

cv.lm(fit2, m=10)

b) How if we like to manipulate the variables, how to use it in cv.lm() for e.g

y.1 <- (sampledata$Y/sampledata$X)
x.1 <- (1/sampledata$X)

#fit 4 problem
fit4=lm(y.1 ~ x.1)
cv.lm(df=sampledata, fit4, m=10)

Is there a way I could reference y.1 and x.1 instead of the header Y ~ X in the function?

Thanks.

like image 794
Saravanan K Avatar asked Nov 02 '22 08:11

Saravanan K


1 Answers

I'm not sure about why exactly this happens, but I've spotted that you do not specify data argument for lm(), so this was my first guess.

fit=lm(Y ~ X, data=sampledata)

Since the error is gone, this may be a sufficient answer.

enter image description here

UPD: The reason for the error is that y.1 and x.1 do not exist in sampledata, which is provided as df argument for cv.lm, so that formula y.1 ~ x.1 makes no sense in the cv.lm environment.

like image 67
tonytonov Avatar answered Nov 15 '22 07:11

tonytonov