Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

John Tukey "median median" (or "resistant line") statistical test for R and linear regression

I'm searching the John Tukey algorithm which compute a "resistant line" or "median-median line" on my linear regression with R.

A student on a mailling list explain this algorithm in these terms :

"The way it's calculated is to divide the data into three groups, find the x-median and y-median values (called the summary point) for each group, and then use those three summary points to determine the line. The outer two summary points determine the slope, and an average of all of them determines the intercept."

Article about John tukey's median median for curious : http://www.johndcook.com/blog/2009/06/23/tukey-median-ninther/

Do you have an idea of where i could find this algorithm or R function ? In which packages, Thanks a lot !

like image 936
reyman64 Avatar asked Jul 11 '10 21:07

reyman64


1 Answers

There's a description of how to calculate the median-median line here. An R implementation of that is

median_median_line <- function(x, y, data)
{
  if(!missing(data))
  {
    x <- eval(substitute(x), data) 
    y <- eval(substitute(y), data) 
  }
  
  stopifnot(length(x) == length(y))

  #Step 1
  one_third_length <- floor(length(x) / 3)
  groups <- rep(1:3, times = switch((length(x) %% 3) + 1,
     one_third_length,
     c(one_third_length, one_third_length + 1, one_third_length),
     c(one_third_length + 1, one_third_length, one_third_length + 1)
  ))

  #Step 2
  x <- sort(x)
  y <- sort(y)
  
  #Step 3
  median_x <- tapply(x, groups, median)                                 
  median_y <- tapply(y, groups, median)

  #Step 4
  slope <- (median_y[3] - median_y[1]) / (median_x[3] - median_x[1])
  intercept <- median_y[1] - slope * median_x[1]

  #Step 5
  middle_prediction <- intercept + slope * median_x[2]
  intercept <- intercept + (median_y[2] - middle_prediction) / 3
  c(intercept = unname(intercept), slope = unname(slope))
}

To test it, here's an example:

dfr <- data.frame(
  time = c(.16, .24, .25, .30, .30, .32, .36, .36, .50, .50, .57, .61, .61, .68, .72, .72, .83, .88, .89),
  distance = c(12.1, 29.8, 32.7, 42.8, 44.2, 55.8, 63.5, 65.1, 124.6, 129.7, 150.2, 182.2, 189.4, 220.4, 250.4, 261.0, 334.5, 375.5, 399.1))
  
median_median_line(time, distance, dfr) 
#intercept     slope 
#   -113.6     520.0

Note the slightly odd way of specifying the groups. The instructions are quite picky about how you define group sizes, so the more obvious method of cut(x, quantile(x, seq.int(0, 1, 1/3))) doesn't work.

like image 122
Richie Cotton Avatar answered Nov 08 '22 03:11

Richie Cotton