Sum IF Multiple Criteria with a Difference - R

Question

So, I asked this question earlier but for array formulas in Excel. However, my dataset is so large that the spreadsheet becomes near on unresponsive so I want to be able to run it in R for efficiency purposes.

Link to previously answered question

I'll reiterate and expand on it here though.

Day     Type    Val
1       A       5
1       B       6
1       C       9
1       D       7
2       B       2
2       A       8
2       C       3
2       D       3
3       C       4
3       B       2
3       A       2
4       A       5
4       B       9
4       C       8
5       A       7
5       B       5
6       A       6
6       B       3
6       C       4
7       A       7
7       B       9

I've provided a simplified data frame above (mine has 10's of thousands of rows and hundreds of 'Type'). I want to be able to assign a 'Type' and then

calculate the average 'Val' for the remaining 'Type' where the assigned 'Type' doesn't exist on that day
calculate the average 'Val' for the remaining 'Type' where the assigned 'Type' does exist on that day

For example, if I assign 'Type' C, then C does not exist on days 5 and 7.

The output would then be:

Type     Average_Without    Average_With
A        7.0                5.2
B        7.0                4.4
C        NA                 5.2
D        NA                 5.0

Cath · Accepted Answer

Here is a way to get the values you want, using data.table package (but @DavidArenburg solution is better and will result in your desired output):

library(data.table)

# Average_with:
setDT(df)[Day %in% Day[Type=="C"], mean(Val), by=Type]
#   Type  V1
#1:    A 5.2
#2:    B 4.4
#3:    C 5.6
#4:    D 5.0

# Average_without
df[!Day %in% Day[Type=="C"], mean(Val), by=Type]
#   Type V1
#1:    A  7
#2:    B  7

Edit, to get your output:

setDT(df)[, C:=(Day %in% Day[Type=="C"])]
res <- df[, mean(Val), by=.(Type, C)]
dcast(res, Type~C, value.var="V1")
#   Type FALSE TRUE
#1:    A     7  5.2
#2:    B     7  4.4
#3:    C    NA  5.6
#4:    D    NA  5.0

Or more direct, thanks to @Frank:

setDT(df)[, C := Day %in% df[Type=="C", unique(Day)]]
dcast(df, Type~C, value.var="Val", fun=mean)

Some benchmarks

set.seed(123)
DF <- df <- data.frame(Day=sample(1:1000, 50000, replace=TRUE), 
                 Type=paste0(sample(letters[1:15], 50000, replace=TRUE), sample(letters[1:15], 50000, replace=TRUE)),
                 Val=rnorm(50000), stringsAsFactors=FALSE)
library(microbenchmark)
# the test is made on "mo" (205 days with, 795 days without)
microbenchmark(cath1(df), cath2(df), cath_Frank(df), david1(df), david2(df), GG1(DF), GG2(DF), GG3(DF), GG4(DF), unit="relative")
         # expr        min         lq       mean    median         uq       max neval  cld
      # cath1(df)  1.3533329  1.4811559  1.4137217  1.502278  1.4146092 1.0523168   100 a   
      # cath2(df)  1.0000000  1.0000000  1.0000000  1.000000  1.0000000 1.0000000   100 a   
 # cath_Frank(df)  1.2985873  1.2980500  1.2380200  1.312180  1.2882213 0.9784906   100 a   
     # david1(df)  0.8642457  0.8717645  0.9768656  1.012679  0.9367868 0.9864712   100 a   
     # david2(df)  1.1708477  1.1723941  1.2105785  1.307281  1.2139049 0.9624526   100 a   
        # GG1(DF) 13.8436430 13.7552020 13.0925864 13.727017 15.9302047 3.0718886   100    d
        # GG2(DF)  4.8765172  4.8827384  4.9342907  4.943654  4.5173281 1.6818194   100  b  
        # GG3(DF)  8.8005347  8.8393882  9.6084771  9.048975 11.9310902 4.1580238   100   c 
        # GG4(DF)  4.4787631  4.5812781  4.5098152  4.623952  4.2268167 1.5829500   100  b

Sum IF Multiple Criteria with a Difference - R

Tags:

dataframe

r

Morts81

1 Answers

Cath

Recent Activity

Donate For Us

Sum IF Multiple Criteria with a Difference - R

Tags:

dataframe

r

Morts81

1 Answers

Cath

Related questions

Recent Activity

Donate For Us