Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ddply summarise proportional count

Tags:

r

plyr

I am having some trouble using the ddply function from the plyr package. I am trying to summarise the following data with counts and proportions within each group. Here's my data:

    structure(list(X5employf = structure(c(1L, 3L, 1L, 1L, 1L, 3L, 
1L, 1L, 1L, 3L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 1L, 2L, 2L, 2L, 
2L, 2L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 1L, 1L, 3L, 1L, 
3L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 
3L, 3L, 1L), .Label = c("increase", "decrease", "same"), class = "factor"), 
    X5employff = structure(c(2L, 6L, NA, 2L, 4L, 6L, 5L, 2L, 
    2L, 8L, 2L, 2L, 2L, 7L, 7L, 8L, 11L, 7L, 2L, 8L, 8L, 11L, 
    7L, 6L, 2L, 5L, 2L, 8L, 7L, 7L, 7L, 8L, 6L, 7L, 5L, 5L, 7L, 
    2L, 6L, 7L, 2L, 2L, 2L, 2L, 2L, 5L, 5L, 5L, 2L, 5L, 2L, 2L, 
    2L, 5L, 12L, 2L, 2L, 2L, 2L, 5L, 5L, 5L, 5L, 2L, 5L, 2L, 
    13L, 9L, 9L, 9L, 7L, 8L, 5L), .Label = c("", "1", "1  and 8", 
    "2", "3", "4", "5", "6", "6 and 7", "6 and 7 ", "7", "8", 
    "1 and 8"), class = "factor")), .Names = c("X5employf", "X5employff"
), row.names = c(NA, 73L), class = "data.frame")

And here's my call using ddply:

ddply(kano_final, .(X5employf, X5employff), summarise, n=length(X5employff), prop=(n/sum(n))*100)

This gives me the counts of each instance of X5employff correctly, but but seems as though the proportion is being calculated across each row and not within each level of the factor X5employf as follows:

   X5employf X5employff  n prop
1   increase          1 26  100
2   increase          2  1  100
3   increase          3 15  100
4   increase    1 and 8  1  100
5   increase       <NA>  1  100
6   decrease          4  1  100
7   decrease          5  5  100
8   decrease          6  2  100
9   decrease          7  1  100
10  decrease          8  1  100
11      same          4  4  100
12      same          5  6  100
13      same          6  5  100
14      same    6 and 7  3  100
15      same          7  1  100

When manually calculating the proportions within each group I get this:

   X5employf X5employff  n prop
1   increase          1 26  59.09
2   increase          2  1  2.27
3   increase          3 15  34.09
4   increase    1 and 8  1  2.27
5   increase       <NA>  1  2.27
6   decrease          4  1  10.00
7   decrease          5  5  50.00
8   decrease          6  2  20.00
9   decrease          7  1  10.00
10  decrease          8  1  10.00
11      same          4  4  21.05
12      same          5  6  31.57
13      same          6  5  26.31
14      same    6 and 7  3  15.78
15      same          7  1  5.26

As you can see the sum of proportions in each level of factor X5employf equals 100.

I know this is probably ridiculously simple, but I can't seem to get my head around it despite reading all sorts of similar posts. Can anyone help with this and my understanding of how the summarise function works?!

Many, many thanks

Marty

like image 499
marty_c Avatar asked Aug 05 '13 11:08

marty_c


1 Answers

You cannot do it in one ddply call because what gets passed to each summarize call is a subset of your data for a specific combination of your group variables. At this lowest level, you do not have access to that intermediate level sum(n). Instead, do it in two steps:

kano_final <- ddply(kano_final, .(X5employf), transform,
                    sum.n = length(X5employf))

ddply(kano_final, .(X5employf, X5employff), summarise, 
      n = length(X5employff), prop = n / sum.n[1] * 100)

Edit: using a single ddply call and using table as you hinted towards:

ddply(kano_final, .(X5employf), summarise,
      n          = Filter(function(x) x > 0, table(X5employff, useNA = "ifany")),
      prop       = 100* prop.table(n),
      X5employff = names(n))
like image 58
flodel Avatar answered Nov 15 '22 05:11

flodel