I have to calculate a coefficient based on a dataset stored in a data frame (A) of size 4936 obs. x 1025 var.
In a first row [1,] time in seconds is presented, each row is a sample collected from a different place. A sample of the data frame A:
# V1 V2 V3 V4
# [1,] 26.4 26.5 26.6 26.7
# [2,] -15 -5 2 3
# [3,] 6 -7 5 8
# [4,] 9 4 4 -2
In another data frame (B) I stored the time from which I should start my calculations for each row in A. An example of the data frame B:
# time
# [1,] 26.4
# [2,] 26.6
# [3,] 26.5
Let's simplify that the coefficient is a sum of the data collected in one place (data frame A), depending on the time of their collection (data frame B). For the example above, the calculation should work like this:
sum1=-15+(-5)+2+3
sum2=5+8
sum3=4+4+(-2)
The results of the calculations I would like to store in a new data frame, which would look like this:
# Sum
# [1,] -15
# [2,] 13
# [3,] 6
How to link the calculations between the two data frames depending on a value stored in the second data frame?
Solution using sapply to iterate and select columns depending on time of collection:
# Time from original table
foo <- df1[1, ]
# Time from table B
time <- c(26.4, 26.6, 26.5)
# Remove time row from original table
df1 <- df1[-1, ]
# Iterate over and select columns with foo >= time
sapply(1:length(time), function(x)
sum(df1[x, which(foo >= time[x])])
)
# [1] -15 13 6
I came across this already answered question and felt urged to propose an alternative solution.
None of the other answers bothered to question these oddities although they made the proposed solutions more complex.
As a wild guess, the data seem to be collected in an Excel sheet. However, for an efficient processing we need the data to be stored column-wise and preferably in long format:
library(data.table)
long <- as.data.table(t(A))[
, setnames(.SD, "V1", "time")][
, melt(.SD, id.vars = "time", variable.name = "site_id")][
, site_id := as.integer(site_id)][]
long
time site_id value 1: 26.4 1 -15 2: 26.5 1 -5 3: 26.6 1 2 4: 26.7 1 3 5: 26.4 2 6 6: 26.5 2 -7 7: 26.6 2 5 8: 26.7 2 8 9: 26.4 3 9 10: 26.5 3 4 11: 26.6 3 4 12: 26.7 3 -2
Now, the OP has requested to aggregate the observations for each site but only observations above a specific time need to be included. A data frame B with the starting times for each site is supplied.
The observations in long can be combined with the starting times in B as follows:
B <- data.table(
site_id = 1:3,
time = c(26.4, 26.6, 26.5))
B
site_id time 1: 1 26.4 2: 2 26.6 3: 3 26.5
# aggregating in a non-equi join grouped by the join conditions
long[B, on = .(site_id, time >= time), by = .EACHI, sum(value)]
site_id time V1 1: 1 26.4 -15 2: 2 26.6 13 3: 3 26.5 6
The OP has asked in a comment and in another question how to limit the number of observations to be aggregated after the starting time. This can be achieved by a slight modification:
max_values <- 2L
long[B, on = .(site_id, time >= time), by = .EACHI, sum(value[1:max_values])]
site_id time V1 1: 1 26.4 -20 2: 2 26.6 13 3: 3 26.5 8
Note that max_values is set to 2L here for illustration.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With