Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to calculate percentage for each cell in a dataframe using ddply?

Tags:

r

plyr

My guess is that this is easy using ddply but Im still a newbie at R and can't get my head around it.

I have a data.frame looking like this

txt <- "label var1 var2 var3 var4 var5 var6 var7
lab1 401 80 57 125 118 182 83
lab2 72 192 80 224 182 187 178
lab3 7 152 134 104 105 80 130
lab4 3 58 210 30 78 33 87
lab5 1 2 3 1 1 2 6"

mydata <- read.table(textConnection(txt), sep = " ", header = TRUE)

doing this I can transform one variable at a time into percentage

mydata$var1 <- round(prop.table(mydata$var1),3)*100

But how to do it with all variables (var1:var7) in a data.frame in one stroke?

NOTE: It is going into a function, in which length and number of variables differs from time to time, and hence the code should be sensitive to this.

Thank you in advance

like image 415
Einnor Avatar asked Jun 05 '13 22:06

Einnor


3 Answers

Just coerce to a matrix and use the margin argument to prop.table like so:

round( prop.table(as.matrix(df),2) * 100 , 3 )

For example

set.seed(123)
df <- data.frame( matrix( sample(4 , 12 , repl=TRUE ) , 3 ) )
df
#  X1 X2 X3 X4
#1  2  4  3  2
#2  4  4  4  4
#3  2  1  3  2
round( prop.table(as.matrix(df),2) * 100 , 3 )
#    X1     X2 X3 X4
#[1,] 25 44.444 30 25
#[2,] 50 44.444 40 50
#[3,] 25 11.111 30 25

In your example it looks like what I thought were rownames is actually a column of character values. To use prop.table on all columns except this first one you can do prop.table( df[,-1] , margin = 2 ).

like image 195
Simon O'Hanlon Avatar answered Oct 12 '22 23:10

Simon O'Hanlon


No need for fancy packages. This will work as long as you want to do it to all but the first column. You could adapt the conditions for what columns are included if 2:ncol isn't appropriate.

t(round(t(mydata[, 2:ncol(mydata)]) / colSums(mydata[, 2:ncol(mydata)]) * 100, 3))

And, since you asked about plyr and dplyr is the improved version of ddply, here's how you'd do it with that:

require(dplyr)
require(reshape2)

mydata %>% melt(id.vars = "label") %>%
    group_by(variable) %>%
    mutate(prop = round(value / sum(value) * 100, 3)) %>%
    dplyr::select(-value) %>%
    dcast(label ~ variable, fun.aggregate = sum, value.var = "prop")

Convert your data to long format, calculate the proportions, and switch it back to wide. A lot of typing for what Simon O'Hanlon shows to be a quick one-liner, but the dplyr method generalizes nicely to whatever sorts of calculations you might want to do.

like image 32
Gregor Thomas Avatar answered Oct 12 '22 23:10

Gregor Thomas


Maybe something like this can help you:

cbind(label=mydat[,1],as.data.frame(apply(mydat[,-1], 2, function(col) round(prop.table(col),3)*100 )))
like image 20
storaged Avatar answered Oct 13 '22 01:10

storaged