Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dplyr: Counts/Percentages of factor grouped by school not getting grouped

Tags:

r

dplyr

I have a long dataset with one row per individual grouped with schools. Each row has an ordered factor {1, 2, 3, 4}, "cats". I want to get the percentage of 1's, 2's, 3's and 4's within each school. The dataset looks like this:

  school_number           cats
1          10505             3
2          10505             3
3          10502             1
4          10502             1
5          10502             2
6          10502             1
7          10502             1
8          10502             2
10         10503             3
11         10505             2

I tried something like this:

df_pcts <- df %>%
   group_by(school_number) %>%
   mutate(total=sum(table(cats))) %>%
   summarize(cat_pct = table(cats)/total)

but the total variable produced by the mutate() step puts the grand total number of rows in every row. I can't even get to the final summarize step. I'm confused.

P.S. In some other posts I saw lines like this:

n = n()

when I do that I get a message saying,

Error in n() : This function should not be called directly

Where did this come from?

TIA

like image 341
Stuart Avatar asked Sep 17 '14 02:09

Stuart


1 Answers

Perhaps this helps a little, though I'm not 100% sure of what output you need.

This counts the number of rows of each combination of school_number/cats that exist in your df using tally. Then calculates the percentage of 'cats' in each school_number by then only grouping by school_number.

df %>%
  group_by(school_number,cats) %>%
  tally  %>%
  group_by(school_number) %>%
  mutate(pct=(100*n)/sum(n))

It gives this:

  #    school_number cats n       pct
  #  1         10502    1 4  66.66667
  #  2         10502    2 2  33.33333
  #  3         10503    3 1 100.00000
  #  4         10505    2 1  33.33333
  #  5         10505    3 2  66.66667

EDIT:

to add in rows with 0% that are missing from your sample data, you could do the following. Bind together the output above with a df that contains 0% for all school_number/cats combinations. Only keep the first instance of this bind (first instances always containing values >0% should they exist). I then arranged it by school_number and cats for ease of reading:

y<-df %>%
  group_by(school_number,cats) %>%
  tally  %>%
  group_by(school_number) %>%
  mutate(pct=(100*n)/sum(n)) %>%
  select(-n) 

x<-data.frame(school_number=rep(unique(df$school_number),each=4), cats=1:4,pct=0)  

rbind(y,x) %>%
  group_by(school_number,cats)%>%
  filter(row_number() == 1) %>%
  arrange(school_number,cats)

which gives:

#   school_number cats       pct
#1          10502    1  66.66667
#2          10502    2  33.33333
#3          10502    3   0.00000
#4          10502    4   0.00000
#5          10503    1   0.00000
#6          10503    2   0.00000
#7          10503    3 100.00000
#8          10503    4   0.00000
#9          10505    1   0.00000
#10         10505    2  33.33333
#11         10505    3  66.66667
#12         10505    4   0.00000
like image 88
jalapic Avatar answered Sep 21 '22 11:09

jalapic