I have a long data frame that contains meteorological data from a mast. It contains observations (data$value
) taken at the same time of different parameters (wind speed, direction, air temperature, etc., in data$param
) at different heights (data$z
)
I am trying to efficiently slice this data by $time
, and then apply functions to all of the data collected. Usually functions are applied to a single $param
at a time (i.e. I apply different functions to wind speed than I do to air temperature).
My current method is to use data.frame
and ddply
.
If I want to get all of the wind speed data, I run this:
# find good data ----
df <- data[((data$param == "wind speed") &
!is.na(data$value)),]
I then run my function on df
using ddply()
:
df.tav <- ddply(df,
.(time),
function(x) {
y <-data.frame(V1 = sum(x$value) + sum(x$z),
V2 = sum(x$value) / sum(x$z))
return(y)
})
Usually V1 and V2 are calls to other functions. These are just examples. I do need to run multiple functions on the same data though.
My current approach is very slow. I have not benchmarked it, but it's slow enough I can go get a coffee and come back before a year's worth of data has been processed.
I have order(hundred) towers to process, each with a year of data and 10-12 heights and so I am looking for something faster.
data <- structure(list(time = structure(c(1262304600, 1262304600, 1262304600,
1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600,
1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600,
1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600,
1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600,
1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600,
1262305200, 1262305200, 1262305200, 1262305200, 1262305200, 1262305200,
1262305200), class = c("POSIXct", "POSIXt"), tzone = ""), z = c(0,
0, 0, 100, 100, 100, 120, 120, 120, 140, 140, 140, 160, 160,
160, 180, 180, 180, 200, 200, 200, 40, 40, 40, 50, 50, 50, 60,
60, 60, 80, 80, 80, 0, 0, 0, 100, 100, 100, 120), param = c("temperature",
"humidity", "barometric pressure", "wind direction", "turbulence",
"wind speed", "wind direction", "turbulence", "wind speed", "wind direction",
"turbulence", "wind speed", "wind direction", "turbulence", "wind speed",
"wind direction", "turbulence", "wind speed", "wind direction",
"turbulence", "wind speed", "wind direction", "turbulence", "wind speed",
"wind direction", "turbulence", "wind speed", "wind direction",
"turbulence", "wind speed", "wind direction", "turbulence", "wind speed",
"temperature", "barometric pressure", "humidity", "wind direction",
"wind speed", "turbulence", "wind direction"), value = c(-2.5,
41, 816.9, 248.4, 0.11, 4.63, 249.8, 0.28, 4.37, 255.5, 0.32,
4.35, 252.4, 0.77, 5.08, 248.4, 0.65, 3.88, 313, 0.94, 6.35,
250.9, 0.1, 4.75, 253.3, 0.11, 4.68, 255.8, 0.1, 4.78, 254.9,
0.11, 4.7, -3.3, 816.9, 42, 253.2, 2.18, 0.27, 229.5)), .Names = c("time",
"z", "param", "value"), row.names = c(NA, 40L), class = "data.frame")
Use data.table
:
library(data.table)
dt = data.table(data)
setkey(dt, param) # sort by param to look it up fast
dt[J('wind speed')][!is.na(value),
list(sum(value) + sum(z), sum(value)/sum(z)),
by = time]
# time V1 V2
#1: 2009-12-31 18:10:00 1177.57 0.04209735
#2: 2009-12-31 18:20:00 102.18 0.02180000
If you want to apply a different function for each param, here's a more uniform approach for that.
# make dt smaller because I'm lazy
dt = dt[param %in% c('wind direction', 'wind speed')]
# now let's start - create another data.table
# that will have param and corresponding function
fns = data.table(p = c('wind direction', 'wind speed'),
fn = c(quote(sum(value) + sum(z)), quote(sum(value) / sum(z))),
key = 'p')
fns
p fn
1: wind direction <call> # the fn column contains functions
2: wind speed <call> # i.e. this is getting fancy!
# now we can evaluate different functions for different params,
# sliced by param and time
dt[!is.na(value), {param; eval(fns[J(param)]$fn[[1]], .SD)},
by = list(param, time)]
# param time V1
#1: wind direction 2009-12-31 18:10:00 3.712400e+03
#2: wind direction 2009-12-31 18:20:00 7.027000e+02
#3: wind speed 2009-12-31 18:10:00 4.209735e-02
#4: wind speed 2009-12-31 18:20:00 2.180000e-02
P.S. I think the fact that I have to use param
in some way before eval
for eval
to work is a bug.
UPDATE: As of version 1.8.11 this bug has been fixed and the following works:
dt[!is.na(value), eval(fns[J(param)]$fn[[1]], .SD), by = list(param, time)]
Use dplyr. It's still in development, but it's much much faster than plyr:
# devtools::install_github(dplyr)
library(dplyr)
windspeed <- subset(data, param == "wind speed")
daily <- group_by(windspeed, time)
summarise(daily, V1 = sum(value) + sum(z), V2 = sum(value) / sum(z))
The other advantage of dplyr is that you can use a data table as a backend, without having to know anything about data.table's special syntax:
library(data.table)
daily_dt <- group_by(data.table(windspeed), time)
summarise(daily_dt, V1 = sum(value) + sum(z), V2 = sum(value) / sum(z))
(dplyr with a data frame is 20-100x faster than plyr, and dplyr with a data.table is about another 10x faster). dplyr is nowhere near as concise as data.table, but it has a function for each major task of data analysis, which I find makes the code easier to understand - you speed almost be able to read a sequence of dplyr operations to someone else and have them understand what's going on.
If you want to do different summaries per variable, I recommend changing your data structure to be "tidy":
library(reshape2)
data_tidy <- dcast(data, ... ~ param)
daily_tidy <- group_by(data_tidy, time)
summarise(daily_tidy,
mean.pressure = mean(`barometric pressure`, na.rm = TRUE),
sd.turbulence = sd(`barometric pressure`, na.rm = TRUE)
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With