Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Summarising levels of multiple factor variables

Tags:

dataframe

r

dplyr

I have searched for similar questions, but cannot find the exact solution required. This question is somewhat similar, but only deals with the issue of summarising multiple continuous variables, not factors.

I have a dataframe consisting of 4 factor variables (sex, agegroup, hiv, group), e.g.

set.seed(20150710)
df<-data.frame(sex=as.factor(c(sample(1:2, 10000, replace=T))), 
           agegroup=as.factor(c(sample(1:5,10000, replace=T))),
           hiv=as.factor(c(sample(1:3,10000, replace=T))),
           group=as.factor(c(sample(1:2,10000, replace=T)))
           )

levels(df$sex)<- c("Male", "Female")
levels(df$agegroup)<- c("16-24", "25-34", "35-44", "45-54", "55+")
levels(df$hiv)<-c("Positive", "Negative", "Not tested")
levels(df$group)<-c("Intervention", "Control")

I would like to create a summary table giving counts and proportions for each level of the exposure variables sex, agegroup and hiv, stratified by group.

EDIT: this is what I am aiming for:

                X N_Control Percent_Control N_Intervention     Percent_Intervention
1      sex_Female      2517       0.5041057           2480            0.4953066
2        sex_Male      2476       0.4958943           2527            0.5046934
3  agegroup_16-24      1005       0.2012818            992            0.1981226
4  agegroup_25-34      1001       0.2004807            996            0.1989215
5  agegroup_35-44      1010       0.2022832            997            0.1991212
6  agegroup_45-54       976       0.1954737            996            0.1989215
7    agegroup_55+      1001       0.2004807           1026            0.2049131
8    hiv_Negative      1679       0.3362708           1642            0.3279409
9  hiv_Not tested      1633       0.3270579           1660            0.3315359
10   hiv_Positive      1681       0.3366713           1705            0.3405233

But I cannot get it to work with summarise_each in dplyr; only overall variable counts and proportions, and not for each factor level, are given:

df.out<-df %>%
  group_by(group) %>%
  summarise_each(funs(N=n(), Percent=n()/sum(n())), sex, agegroup, hiv)
print(df.out)

group sex_N agegroup_N hiv_N sex_Percent agegroup_Percent hiv_Percent
1     1  4973       4973  4973           1                1           1
2     2  5027       5027  5027           1                1           1

Finally, is there some way to reshape the table (e.g. using tidyr), so that the exposure variables (sex, agegroup, hiv) are reported as rows?

Thanks

like image 944
Peter MacPherson Avatar asked Feb 10 '23 13:02

Peter MacPherson


1 Answers

Doing it in two steps will give you the desired result. First, calculate the n, then calculate the percentage by group:

library(dplyr)
df.out <- df %>%
  group_by(group, sex, agegroup, hiv) %>%
  tally() %>%
  group_by(group) %>%
  mutate(percent=n/sum(n))

A solution with data.table:

library(data.table)
dt.out <- setDT(df)[, .N, by=.(group, sex, agegroup, hiv)][, percent:=N/sum(N), by=group]

library(microbenchmark)
microbenchmark(df.out = df %>%
                 group_by(group, sex, agegroup, hiv) %>%
                 tally() %>%
                 group_by(group) %>%
                 mutate(percent=n/sum(n)),
               dt.out = df[,.N,by=.(group, sex, agegroup, hiv)][,percent:=N/sum(N),by=group])

# Unit: milliseconds
#   expr      min       lq     mean   median       uq       max neval cld
# df.out 8.299870 8.518590 8.894504 8.708315 8.931459 11.964930   100   b
# dt.out 2.346632 2.394788 2.540132 2.441777 2.551235  4.344442   100  a 

Conclusion: the data.table solution is much faster (3.5x).


To get a table like you requested after the edit of your question, you can do the following:

library(data.table)

setDT(df)
dt.sex <- dcast(df[,.N, by=.(sex,group)][,percent:=N/sum(N)], sex ~ group, value.var = c("N", "percent"))
dt.age <- dcast(df[,.N, by=.(agegroup,group)][,percent:=N/sum(N)], agegroup ~ group, value.var = c("N", "percent"))
dt.hiv <- dcast(df[,.N, by=.(hiv,group)][,percent:=N/sum(N)], hiv ~ group, value.var = c("N", "percent"))

dt.out.wide <- rbindlist(list(dt.sex, dt.age, dt.hiv), use.names=FALSE)
names(dt.out.wide) <- c("X","N_Intervention","N_Control","percent_Intervention","percent_Control")

this gives:

> dt.out.wide
             X N_Intervention N_Control percent_Intervention percent_Control
 1:       Male           2454      2488               0.2454          0.2488
 2:     Female           2561      2497               0.2561          0.2497
 3:      16-24            954       991               0.0954          0.0991
 4:      25-34           1033      1002               0.1033          0.1002
 5:      35-44           1051      1000               0.1051          0.1000
 6:      45-54            983       978               0.0983          0.0978
 7:        55+            994      1014               0.0994          0.1014
 8:   Positive           1717      1664               0.1717          0.1664
 9:   Negative           1637      1659               0.1637          0.1659
10: Not tested           1661      1662               0.1661          0.1662
like image 162
Jaap Avatar answered Feb 12 '23 05:02

Jaap