I used a linear regression on data I have, using the <code>lm</code> function. Everything works (no error message), but I'm somehow surprised by the result: I am under the impression R "misses" a group of points, i.e. the intercept and slope are not the best fit. For instance, I am referring to the group of points at coordinates x=15-25,y=0-20. My questions: <ul> <li>is there a function to compare fit with "expected" coefficients and "lm-calculated" coefficients?</li> <li>have I made a silly mistake when coding, leading the <code>lm</code> to do that?</li> </ul> Following some answers: additionnal information on x and y x and y are both visual estimates of disease symptoms. There is the same uncertainty on both of them. <img src="https://i.stack.imgur.com/eqksn.png" alt="Data plot, with linear regression and abline of expected results"> The data and code are here: <pre class="prettyprint"><code>x1=c(24.0,23.9,23.6,21.6,21.0,20.8,22.4,22.6, 21.6,21.2,19.0,19.4,21.1,21.5,21.5,20.1,20.1, 20.1,17.2,18.6,21.5,18.2,23.2,20.4,19.2,22.4, 18.8,17.9,19.1,17.9,19.6,18.1,17.6,17.4,17.5, 17.5,25.2,24.4,25.6,24.3,24.6,24.3,29.4,29.4, 29.1,28.5,27.2,27.9,31.5,31.5,31.5,27.8,31.2, 27.4,28.8,27.9,27.6,26.9,28.0,28.0,33.0,32.0, 34.2,34.0,32.6,30.8) y1=c(100.0,95.5,93.5,100.0,98.5,99.5,34.8, 45.8,47.5,17.4,42.6,63.0,6.9,12.1,30.5, 10.5,14.3,41.1, 2.2,20.0,9.8,3.5,0.5,3.5,5.7, 3.1,19.2,6.4, 1.2, 4.5, 5.7, 3.1,19.2, 6.4, 1.2,4.5,81.5,70.5,91.5,75.0,59.5,73.3,66.5, 47.0,60.5,47.5,33.0,62.5,87.0,86.0,77.0, 86.0,83.0,78.5,83.0,83.5,73.0,69.5,82.5,78.5, 84.0,93.5,83.5,96.5,96.0,97.5) ## x11() plot(x1,y1,xlim=c(0,35),ylim=c(0,100)) # linear regression reg_lin=lm(y1 ~ x1) abline(reg_lin,lty="solid", col="royalblue") text(12.5,25,labels="R result",col="royalblue", cex=0.85) text(12.5,20,labels=bquote(y== .(5.26)*x - .(76)),col="royalblue", cex=0.85) # result I would have imagined abline(a=-150,b=8,lty="dashed", col="red") text(27.5,25,labels="What I think is better",col="red", cex=0.85) text(27.5,20,labels=bquote(y== .(8)*x - .(150)),col="red", cex=0.85) </code></pre>

Try this: <pre class="prettyprint"><code>reg_lin_int <- reg_lin$coefficients[1] reg_lin_slp <- reg_lin$coefficients[2] sum((y1 - (reg_lin_int + reg_lin_slp*x1)) ^ 2) # [1] 39486.33 sum((y1 - (-150 + 8 * x1)) ^ 2) # [1] 55583.18 </code></pre> The sum of squared residuals is lower under the <code>lm</code> fit line. This is to be expected, as <code>reg_lin_int</code> and <code>reg_lin_slp</code> are guaranteed to produce the minimal total squared error. Intuitively, we know estimators under squared loss functions are sensitive to outliers. It's "missing" the group at the bottom because it gets closer to the group at the top left that's much further away--and squared distance gives these points more weight. In fact, if we use Least Absolute Deviations regression (i.e., specify an absolute loss function instead of a square), the result is much closer to your guess: <pre class="prettyprint"><code>library(quantreg) lad_reg <- rq(y1 ~ x1) </code></pre> <img src="https://i.stack.imgur.com/SlYHz.png" alt="lad"> (Pro tip: use <code>lwd</code> to make your graphs much more readable) What gets even closer to what you had in mind is Total Least Squares, as mentioned by @nongkrong and @MikeWilliamson. Here is the result of TLS on your sample: <pre class="prettyprint"><code>v <- prcomp(cbind(x1, y1))$rotation bbeta <- v[-ncol(v), ncol(v)] / v[1, 1] inter <- mean(y1) - bbeta * mean(x1) </code></pre> <img src="https://i.stack.imgur.com/IAV4E.png" alt="tls">

linear regression using lm() - surprised by the result

Tags:

r

linear-regression

lm

least-squares

orthogonal

I used a linear regression on data I have, using the lm function. Everything works (no error message), but I'm somehow surprised by the result: I am under the impression R "misses" a group of points, i.e. the intercept and slope are not the best fit. For instance, I am referring to the group of points at coordinates x=15-25,y=0-20.

My questions:

is there a function to compare fit with "expected" coefficients and "lm-calculated" coefficients?
have I made a silly mistake when coding, leading the lm to do that?

Following some answers: additionnal information on x and y

x and y are both visual estimates of disease symptoms. There is the same uncertainty on both of them. Data plot, with linear regression and abline of expected results

The data and code are here:

x1=c(24.0,23.9,23.6,21.6,21.0,20.8,22.4,22.6,
     21.6,21.2,19.0,19.4,21.1,21.5,21.5,20.1,20.1,
     20.1,17.2,18.6,21.5,18.2,23.2,20.4,19.2,22.4,
     18.8,17.9,19.1,17.9,19.6,18.1,17.6,17.4,17.5,
     17.5,25.2,24.4,25.6,24.3,24.6,24.3,29.4,29.4,
     29.1,28.5,27.2,27.9,31.5,31.5,31.5,27.8,31.2,
     27.4,28.8,27.9,27.6,26.9,28.0,28.0,33.0,32.0,
     34.2,34.0,32.6,30.8)

y1=c(100.0,95.5,93.5,100.0,98.5,99.5,34.8,
     45.8,47.5,17.4,42.6,63.0,6.9,12.1,30.5,
     10.5,14.3,41.1, 2.2,20.0,9.8,3.5,0.5,3.5,5.7,
     3.1,19.2,6.4, 1.2, 4.5, 5.7, 3.1,19.2, 6.4,
     1.2,4.5,81.5,70.5,91.5,75.0,59.5,73.3,66.5,
     47.0,60.5,47.5,33.0,62.5,87.0,86.0,77.0,
     86.0,83.0,78.5,83.0,83.5,73.0,69.5,82.5,78.5,
     84.0,93.5,83.5,96.5,96.0,97.5)   



## x11()
plot(x1,y1,xlim=c(0,35),ylim=c(0,100))

# linear regression
reg_lin=lm(y1 ~ x1)
abline(reg_lin,lty="solid", col="royalblue")
text(12.5,25,labels="R result",col="royalblue", cex=0.85)
text(12.5,20,labels=bquote(y== .(5.26)*x - .(76)),col="royalblue", cex=0.85)

# result I would have imagined
abline(a=-150,b=8,lty="dashed", col="red")
text(27.5,25,labels="What I think is better",col="red", cex=0.85)
text(27.5,20,labels=bquote(y== .(8)*x - .(150)),col="red", cex=0.85)

673

asked Aug 06 '15 18:08

NOTM

1 Answers

Try this:

reg_lin_int <- reg_lin$coefficients[1]
reg_lin_slp <- reg_lin$coefficients[2]

sum((y1 - (reg_lin_int + reg_lin_slp*x1)) ^ 2)
# [1] 39486.33
sum((y1 - (-150 + 8 * x1)) ^ 2)
# [1] 55583.18

The sum of squared residuals is lower under the lm fit line. This is to be expected, as reg_lin_int and reg_lin_slp are guaranteed to produce the minimal total squared error.

Intuitively, we know estimators under squared loss functions are sensitive to outliers. It's "missing" the group at the bottom because it gets closer to the group at the top left that's much further away--and squared distance gives these points more weight.

In fact, if we use Least Absolute Deviations regression (i.e., specify an absolute loss function instead of a square), the result is much closer to your guess:

library(quantreg)
lad_reg <- rq(y1 ~ x1)

lad

(Pro tip: use lwd to make your graphs much more readable)

What gets even closer to what you had in mind is Total Least Squares, as mentioned by @nongkrong and @MikeWilliamson. Here is the result of TLS on your sample:

v <- prcomp(cbind(x1, y1))$rotation
bbeta <- v[-ncol(v), ncol(v)] / v[1, 1]
inter <- mean(y1) - bbeta * mean(x1)

tls

126

answered Sep 30 '22 21:09

MichaelChirico

Related questions
                            
                                ggplot2's mpg dataset--what's the "fl." variable?
                            
                                What arguments can I pass to dbConnect?
                            
                                Why a self-written Rcpp vectorized mathematical function is faster than its base counterpart?
                            
                                How to save an object through GGally in R
                            
                                What is unary plus / minus in R?
                            
                                R check doesn't like std:cout (C++)
                            
                                Adding double quotes to string in R
                            
                                R error in '[<-.data.frame'... replacement has # items, need #
                            
                                R: Setting limits to scale_x_yearqtr in ggplot for yearqtr (zoo)
                            
                                Calculate number of days between two dates in r
                            
                                How to split a decimal number from a string in R
                            
                                How to plot, where each row in a matrix is a line inte plot in R
                            
                                How to install RHadoop packages (Rmr, Rhdfs, Rhbase)?
                            
                                R: Combine list of data frames into single data frame, add column with list index
                            
                                Simultaneous order, row-filter and column-select with data.table
                            
                                r caret predict returns fewer output than input
                            
                                dplyr: max value in a group, excluding the value in each row?
                            
                                How do you change the timezone of Sys.time()
                            
                                Error: could not find function "read_html"
                            
                                How to know if the app is running at local or on server? (R Shiny)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With