Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ddply for sum by group in R

Tags:

I have a sample dataframe "data" as follows:

X            Y  Month   Year    income 2281205 228120  3   2011    1000 2281212 228121  9   2010    1100 2281213 228121  12  2010    900 2281214 228121  3   2011    9000 2281222 228122  6   2010    1111 2281223 228122  9   2010    3000 2281224 228122  12  2010    1889 2281225 228122  3   2011    778 2281243 228124  12  2010    1111 2281244 228124  3   2011    200 2281282 228128  9   2010    7889 2281283 228128  12  2010    2900 2281284 228128  3   2011    3400 2281302 228130  9   2010    1200 2281303 228130  12  2010    2000 2281304 228130  3   2011    1900 2281352 228135  9   2010    2300 2281353 228135  12  2010    1333 2281354 228135  3   2011    2340 

I want to use the ddply to compute the income for each Y(not X), if I have four observations for each Y (for example for 2281223 with months 6,9,12 of 2010 and month 3 of 2011). If I have less than four observations (for example for Y =228130), I want to simply ignore it. I use the following commands in R for the above purpose:

require(plyr)      # the data are in the data csv file     data<-read.csv("data.csv")     # convert Y (integers) into factors     y<-as.factor(y)     # get the count of each unique Y     count<-ddply(data,.(Y), summarize, freq=length(Y))     # get the sum of each unique Y      sum<-ddply(data,.(Y),summarize,tot=sum(income))     # show the sum if number of observations for each Y is less than 4     colbind<-cbind(count,sum)     finalsum<-subset(colbind,freq>3) 

My output are as follows:

>colbind        Y freq      Y   tot 1 228120    1 228120  1000 2 228121    3 228121 11000 3 228122    4 228122  6778 4 228124    2 228124  1311 5 228128    3 228128 14189 6 228130    3 228130  5100 7 228135    3 228135  5973 >finalsum        Y freq    Y.1  tot 3 228122    4 228122 6778 

The above code works, but requires many steps. So,I would like to know whether there is a simple way of performing the above task (using the plyr package).

like image 942
Metrics Avatar asked Dec 26 '12 04:12

Metrics


People also ask

How do I sum by group in r?

Now we can use the group_by and the summarise_at functions to get the summation by group: iris %>% # Specify data frame group_by(Species) %>% # Specify group indicator summarise_at(vars(Sepal. Length), # Specify column list(name = sum)) # Specify function # A tibble: 3 x 2 # Species name # <fct> <dbl> # 1 setosa 250.

What is the Ddply function in r?

ddply: Split data frame, apply function, and return results in a data frame.

How do I group data in a variable in r?

The group_by() method in tidyverse can be used to accomplish this. When working with categorical variables, you may use the group_by() method to divide the data into subgroups based on the variable's distinct categories.


2 Answers

As pointed out in a comment, you can do multiple operations inside the summarize.

This reduces your code to one line of ddply() and one line of subsetting, which is easy enough with the [ operator:

x <- ddply(data, .(Y), summarize, freq=length(Y), tot=sum(income)) x[x$freq > 3, ]         Y freq  tot 3 228122    4 6778 

This is also exceptionally easy with the data.table package:

library(data.table) data.table(data)[, list(freq=length(income), tot=sum(income)), by=Y][freq > 3]         Y freq  tot 1: 228122    4 6778 

In fact, the operation to calculate the length of a vector has its own shortcut in data.table - use the .N shortcut:

data.table(data)[, list(freq=.N, tot=sum(income)), by=Y][freq > 3]         Y freq  tot 1: 228122    4 6778 
like image 78
Andrie Avatar answered Oct 27 '22 10:10

Andrie


I think the package dplyr is faster than plyr::ddply and more elegant.

testData <- read.table(file = "clipboard",header = TRUE) require(dplyr) testData %>%   group_by(Y) %>%   summarise(total = sum(income),freq = n()) %>%   filter(freq > 3) 
like image 20
HatMatrix Avatar answered Oct 27 '22 10:10

HatMatrix