Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sum of values greater than or equal too for each element in grouped dataframe (dplyr) R

I have a relatively large dataframe (~2,000,000 rows) where, for each row, I need to, within that observation's group, calculate the sum of every value greater than or equal to the current row's value.

Here is a sample data frame:

sample_df = data.frame(
  group_id = c(1,1,1,1,2,2,2,2),
  value = c(10,12,14,12,8,8,21,10)
)

I currently have a very slow solution to do this using a loop and some filtering, however, having a faster solution would be much preferred. I've been trying to use dplyr however, I can not figure out how I can get the sum of the other observations after the data are grouped.

With the above toy example, here would be the desired output:

desired_output = data.frame(
  group_id = c(1,1,1,1,2,2,2,2),
  value = c(10,12,14,12,8,8,21,10),
  output = c(38,26,0,26,39,39,0,21)
)

Looking around for solutions to this that have already been posted, I haven't seen a clear answer which explains how one can compare each observation in a group to the other observations, filtered on some criteria, in that group. I'd prefer a dplyr-based solution but if there are efficient base-R or data.table solutions than I'd be equally grateful!

like image 230
John K. Avatar asked Mar 22 '18 20:03

John K.


2 Answers

This is a simple non-equi join problem:

library(data.table)
dt = as.data.table(sample_df)

dt[dt, on = .(group_id, value >= value), by = .EACHI,
   .(output = sum(x.value) - i.value)]
#   group_id value output
#1:        1    10     38
#2:        1    12     26
#3:        1    14      0
#4:        1    12     26
#5:        2     8     39
#6:        2     8     39
#7:        2    21      0
#8:        2    10     21
like image 42
eddi Avatar answered Oct 07 '22 16:10

eddi


Using tidyverse. The trick is to use map_dbl to loop on every value.

library(tidyverse)
sample_df %>%
  group_by(group_id) %>%
  mutate(output= map_dbl(value,~sum(value[value>=.x]))-value) %>%
  ungroup

# A tibble: 8 x 3
  group_id value output
     <dbl> <dbl>  <dbl>
1        1    10     38
2        1    12     26
3        1    14      0
4        1    12     26
5        2     8     39
6        2     8     39
7        2    21      0
8        2    10     21

value in the mutate line is your value 'subcolumn' (the group), while .x is its element you're looping on.

a base solution

within(sample_df,output <- unlist(tapply(
  value,group_id,function(x) sapply(x,function(y) sum(x[x>=y])-y))))
#   group_id value output
# 1        1    10     38
# 2        1    12     26
# 3        1    14      0
# 4        1    12     26
# 5        2     8     39
# 6        2     8     39
# 7        2    21      0
# 8        2    10     21
like image 186
Moody_Mudskipper Avatar answered Oct 07 '22 18:10

Moody_Mudskipper