Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R Language - Sorting data into ranges; averaging; ignore outliers

Tags:

r

outliers

I am analyzing data from a wind turbine, normally this is the sort of thing I would do in excel but the quantity of data requires something heavy-duty. I have never used R before and so I am just looking for some pointers.

The data consists of 2 columns WindSpeed and Power, so far I have arrived at importing the data from a CSV file and scatter-plotted the two against each other.

What I would like to do next is to sort the data into ranges; for example all data where WindSpeed is between x and y and then find the average of power generated for each range and graph the curve formed.

From this average I want recalculate the average based on data which falls within one of two standard deviations of the average (basically ignoring outliers).

Any pointers are appreciated.

For those who are interested I am trying to create a graph similar to this. Its a pretty standard type of graph but like I said the shear quantity of data requires something heavier than excel.

like image 995
klonq Avatar asked Jan 30 '11 13:01

klonq


2 Answers

Since you're no longer in Excel, why not use a modern statistical methodology that doesn't require crude binning of the data and ad hoc methods to remove outliers: locally smooth regression, as implemented by loess.

Using a slight modification of csgillespie's sample data:

w_sp <- sample(seq(0, 100, 0.01), 1000)
power <- 1/(1+exp(-(w_sp -40)/5)) + rnorm(1000, sd = 0.1)

plot(w_sp, power)

x_grid <- seq(0, 100, length = 100)
lines(x_grid, predict(loess(power ~ w_sp), x_grid), col = "red", lwd = 3)
like image 80
hadley Avatar answered Sep 24 '22 06:09

hadley


Throw this version, similar in motivation as @hadley's, into the mix using an additive model with an adaptive smoother using package mgcv:

Dummy data first, as used by @hadley

w_sp <- sample(seq(0, 100, 0.01), 1000)
power <- 1/(1+exp(-(w_sp -40)/5)) + rnorm(1000, sd = 0.1)
df <- data.frame(power = power, w_sp = w_sp)

Fit the additive model using gam(), using an adaptive smoother and smoothness selection via REML

require(mgcv)
mod <- gam(power ~ s(w_sp, bs = "ad", k = 20), data = df, method = "REML")
summary(mod)

Predict from our model and get standard errors of fit, use latter to generate an approximate 95% confidence interval

x_grid <- with(df, data.frame(w_sp = seq(min(w_sp), max(w_sp), length = 100)))
pred <- predict(mod, x_grid, se.fit = TRUE)
x_grid <- within(x_grid, fit <- pred$fit)
x_grid <- within(x_grid, upr <- fit + 2 * pred$se.fit)
x_grid <- within(x_grid, lwr <- fit - 2 * pred$se.fit)

Plot everything and the Loess fit for comparison

plot(power ~ w_sp, data = df, col = "grey")
lines(fit ~ w_sp, data = x_grid, col = "red", lwd = 3)
## upper and lower confidence intervals ~95%
lines(upr ~ w_sp, data = x_grid, col = "red", lwd = 2, lty = "dashed")
lines(lwr ~ w_sp, data = x_grid, col = "red", lwd = 2, lty = "dashed")
## add loess fit from @hadley's answer
lines(x_grid$w_sp, predict(loess(power ~ w_sp, data = df), x_grid), col = "blue",
      lwd = 3)

adaptive smooth and loess fits

like image 40
Gavin Simpson Avatar answered Sep 24 '22 06:09

Gavin Simpson