Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R data.table: subgroup weighted percent of group

I have a data.table like:

library(data.table)
widgets <- data.table(serial_no=1:100, 
                      color=rep_len(c("red","green","blue","black"),length.out=100),
                      style=rep_len(c("round","pointy","flat"),length.out=100),
                      weight=rep_len(1:5,length.out=100) )

Although I am not sure this is the most data.table way, I can calculate subgroup frequency by group using table and length in a single step-- for example, to answer the question "What percent of red widgets are round?"

edit: this code does not provide the right answer

# example A
widgets[, list(style = unique(style), 
               style_pct_of_color_by_count = 
                 as.numeric(table(style)/length(style)) ), by=color]

#    color  style style_pct_of_color_by_count
# 1:   red  round                        0.32
# 2:   red pointy                        0.32
# 3:   red   flat                        0.36
# 4: green pointy                        0.32
# ...

But I can't use that approach to answer questions like "By weight, what percent of red widgets are round?" I can only come up with a two-step approach:

# example B
widgets[,list(cs_weight=sum(weight)),by=list(color,style)][,list(style, style_pct_of_color_by_weight=cs_weight/sum(cs_weight)),by=color]

#    color  style style_pct_of_color_by_weight
# 1:   red  round                    0.3466667
# 2:   red pointy                    0.3466667
# 3:   red   flat                    0.3066667
# 4: green pointy                    0.3333333
# ...

I'm looking for a single-step approach to B, and A if improvable, in an explanation that deepens my understanding of data.table syntax for by-group operations. Please note that this question is different from Weighted sum of variables by groups with data.table because mine involves subgroups and avoiding multiple steps. TYVM.

like image 986
C8H10N4O2 Avatar asked Jun 19 '15 17:06

C8H10N4O2


1 Answers

This is almost a single step:

# A
widgets[,{
    totwt = .N
    .SD[,.(frac=.N/totwt),by=style]
},by=color]
    # color  style frac
 # 1:   red  round 0.36
 # 2:   red pointy 0.32
 # 3:   red   flat 0.32
 # 4: green pointy 0.36
 # 5: green   flat 0.32
 # 6: green  round 0.32
 # 7:  blue   flat 0.36
 # 8:  blue  round 0.32
 # 9:  blue pointy 0.32
# 10: black  round 0.36
# 11: black pointy 0.32
# 12: black   flat 0.32

# B
widgets[,{
    totwt = sum(weight)
    .SD[,.(frac=sum(weight)/totwt),by=style]
},by=color]
 #    color  style      frac
 # 1:   red  round 0.3466667
 # 2:   red pointy 0.3466667
 # 3:   red   flat 0.3066667
 # 4: green pointy 0.3333333
 # 5: green   flat 0.3200000
 # 6: green  round 0.3466667
 # 7:  blue   flat 0.3866667
 # 8:  blue  round 0.2933333
 # 9:  blue pointy 0.3200000
# 10: black  round 0.3733333
# 11: black pointy 0.3333333
# 12: black   flat 0.2933333

How it works: Construct your denominator for the top-level group (color) before going to the finer group (color with style) to tabulate.


Alternatives. If styles repeat within each color and this is only for display purposes, try a table:

# A
widgets[,
  prop.table(table(color,style),1)
]
#        style
# color   flat pointy round
#   black 0.32   0.32  0.36
#   blue  0.36   0.32  0.32
#   green 0.32   0.36  0.32
#   red   0.32   0.32  0.36

# B
widgets[,rep(1L,sum(weight)),by=.(color,style)][,
  prop.table(table(color,style),1)
]

#        style
# color        flat    pointy     round
#   black 0.2933333 0.3333333 0.3733333
#   blue  0.3866667 0.3200000 0.2933333
#   green 0.3200000 0.3333333 0.3466667
#   red   0.3066667 0.3466667 0.3466667

For B, this expands the data so that there is one observation for each unit of weight. With large data, such an expansion would be a bad idea (since it costs so much memory). Also, weight has to be an integer; otherwise, its sum will be silently truncated to one (e.g., try rep(1,2.5) # [1] 1 1).

like image 76
Frank Avatar answered Nov 01 '22 05:11

Frank