So, I asked this question earlier but for array formulas in Excel. However, my dataset is so large that the spreadsheet becomes near on unresponsive so I want to be able to run it in R for efficiency purposes.
Link to previously answered question
I'll reiterate and expand on it here though.
Day Type Val
1 A 5
1 B 6
1 C 9
1 D 7
2 B 2
2 A 8
2 C 3
2 D 3
3 C 4
3 B 2
3 A 2
4 A 5
4 B 9
4 C 8
5 A 7
5 B 5
6 A 6
6 B 3
6 C 4
7 A 7
7 B 9
I've provided a simplified data frame above (mine has 10's of thousands of rows and hundreds of 'Type'). I want to be able to assign a 'Type' and then
calculate the average 'Val' for the remaining 'Type' where the assigned 'Type' doesn't exist on that day
calculate the average 'Val' for the remaining 'Type' where the assigned 'Type' does exist on that day
For example, if I assign 'Type' C, then C does not exist on days 5 and 7.
The output would then be:
Type Average_Without Average_With
A 7.0 5.2
B 7.0 4.4
C NA 5.2
D NA 5.0
Here is a way to get the values you want, using data.table
package (but @DavidArenburg solution is better and will result in your desired output):
library(data.table)
# Average_with:
setDT(df)[Day %in% Day[Type=="C"], mean(Val), by=Type]
# Type V1
#1: A 5.2
#2: B 4.4
#3: C 5.6
#4: D 5.0
# Average_without
df[!Day %in% Day[Type=="C"], mean(Val), by=Type]
# Type V1
#1: A 7
#2: B 7
Edit, to get your output:
setDT(df)[, C:=(Day %in% Day[Type=="C"])]
res <- df[, mean(Val), by=.(Type, C)]
dcast(res, Type~C, value.var="V1")
# Type FALSE TRUE
#1: A 7 5.2
#2: B 7 4.4
#3: C NA 5.6
#4: D NA 5.0
Or more direct, thanks to @Frank:
setDT(df)[, C := Day %in% df[Type=="C", unique(Day)]]
dcast(df, Type~C, value.var="Val", fun=mean)
Some benchmarks
set.seed(123)
DF <- df <- data.frame(Day=sample(1:1000, 50000, replace=TRUE),
Type=paste0(sample(letters[1:15], 50000, replace=TRUE), sample(letters[1:15], 50000, replace=TRUE)),
Val=rnorm(50000), stringsAsFactors=FALSE)
library(microbenchmark)
# the test is made on "mo" (205 days with, 795 days without)
microbenchmark(cath1(df), cath2(df), cath_Frank(df), david1(df), david2(df), GG1(DF), GG2(DF), GG3(DF), GG4(DF), unit="relative")
# expr min lq mean median uq max neval cld
# cath1(df) 1.3533329 1.4811559 1.4137217 1.502278 1.4146092 1.0523168 100 a
# cath2(df) 1.0000000 1.0000000 1.0000000 1.000000 1.0000000 1.0000000 100 a
# cath_Frank(df) 1.2985873 1.2980500 1.2380200 1.312180 1.2882213 0.9784906 100 a
# david1(df) 0.8642457 0.8717645 0.9768656 1.012679 0.9367868 0.9864712 100 a
# david2(df) 1.1708477 1.1723941 1.2105785 1.307281 1.2139049 0.9624526 100 a
# GG1(DF) 13.8436430 13.7552020 13.0925864 13.727017 15.9302047 3.0718886 100 d
# GG2(DF) 4.8765172 4.8827384 4.9342907 4.943654 4.5173281 1.6818194 100 b
# GG3(DF) 8.8005347 8.8393882 9.6084771 9.048975 11.9310902 4.1580238 100 c
# GG4(DF) 4.4787631 4.5812781 4.5098152 4.623952 4.2268167 1.5829500 100 b
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With