Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Function to calculate R2 (R-squared) in R

I have a dataframe with observed and modelled data, and I would like to calculate the R2 value. I expected there to be a function I could call for this, but can't locate one. I know I can write my own and apply it, but am I missing something obvious? I want something like

obs <- 1:5 mod <- c(0.8,2.4,2,3,4.8) df <- data.frame(obs, mod)  R2 <- rsq(df) # 0.85 
like image 260
Esme_ Avatar asked Dec 01 '16 02:12

Esme_


2 Answers

You need a little statistical knowledge to see this. R squared between two vectors is just the square of their correlation. So you can define you function as:

rsq <- function (x, y) cor(x, y) ^ 2 

Sandipan's answer will return you exactly the same result (see the following proof), but as it stands it appears more readable (due to the evident $r.squared).


Let's do the statistics

Basically we fit a linear regression of y over x, and compute the ratio of regression sum of squares to total sum of squares.

lemma 1: a regression y ~ x is equivalent to y - mean(y) ~ x - mean(x)

lemma 1

lemma 2: beta = cov(x, y) / var(x)

lemma 2

lemma 3: R.square = cor(x, y) ^ 2

lemma 3


Warning

R squared between two arbitrary vectors x and y (of the same length) is just a goodness measure of their linear relationship. Think twice!! R squared between x + a and y + b are identical for any constant shift a and b. So it is a weak or even useless measure on "goodness of prediction". Use MSE or RMSE instead:

  • How to obtain RMSE out of lm result?
  • R - Calculate Test MSE given a trained model from a training set and a test set

I agree with 42-'s comment:

The R squared is reported by summary functions associated with regression functions. But only when such an estimate is statistically justified.

R squared can be a (but not the best) measure of "goodness of fit". But there is no justification that it can measure the goodness of out-of-sample prediction. If you split your data into training and testing parts and fit a regression model on the training one, you can get a valid R squared value on training part, but you can't legitimately compute an R squared on the test part. Some people did this, but I don't agree with it.

Here is very extreme example:

preds <- 1:4/4 actual <- 1:4 

The R squared between those two vectors is 1. Yes of course, one is just a linear rescaling of the other so they have a perfect linear relationship. But, do you really think that the preds is a good prediction on actual??


In reply to wordsforthewise

Thanks for your comments 1, 2 and your answer of details.

You probably misunderstood the procedure. Given two vectors x and y, we first fit a regression line y ~ x then compute regression sum of squares and total sum of squares. It looks like you skip this regression step and go straight to the sum of square computation. That is false, since the partition of sum of squares does not hold and you can't compute R squared in a consistent way.

As you demonstrated, this is just one way for computing R squared:

preds <- c(1, 2, 3) actual <- c(2, 2, 4) rss <- sum((preds - actual) ^ 2)  ## residual sum of squares tss <- sum((actual - mean(actual)) ^ 2)  ## total sum of squares rsq <- 1 - rss/tss #[1] 0.25 

But there is another:

regss <- sum((preds - mean(preds)) ^ 2) ## regression sum of squares regss / tss #[1] 0.75 

Also, your formula can give a negative value (the proper value should be 1 as mentioned above in the Warning section).

preds <- 1:4 / 4 actual <- 1:4 rss <- sum((preds - actual) ^ 2)  ## residual sum of squares tss <- sum((actual - mean(actual)) ^ 2)  ## total sum of squares rsq <- 1 - rss/tss #[1] -2.375 

Final remark

I had never expected that this answer could eventually be so long when I posted my initial answer 2 years ago. However, given the high views of this thread, I feel obliged to add more statistical details and discussions. I don't want to mislead people that just because they can compute an R squared so easily, they can use R squared everywhere.

like image 63
Zheyuan Li Avatar answered Oct 05 '22 22:10

Zheyuan Li


Why not this:

rsq <- function(x, y) summary(lm(y~x))$r.squared rsq(obs, mod) #[1] 0.8560185 
like image 42
Sandipan Dey Avatar answered Oct 05 '22 22:10

Sandipan Dey