I have a sample dataframe "data" as follows:
X Y Month Year income 2281205 228120 3 2011 1000 2281212 228121 9 2010 1100 2281213 228121 12 2010 900 2281214 228121 3 2011 9000 2281222 228122 6 2010 1111 2281223 228122 9 2010 3000 2281224 228122 12 2010 1889 2281225 228122 3 2011 778 2281243 228124 12 2010 1111 2281244 228124 3 2011 200 2281282 228128 9 2010 7889 2281283 228128 12 2010 2900 2281284 228128 3 2011 3400 2281302 228130 9 2010 1200 2281303 228130 12 2010 2000 2281304 228130 3 2011 1900 2281352 228135 9 2010 2300 2281353 228135 12 2010 1333 2281354 228135 3 2011 2340
I want to use the ddply
to compute the income for each Y
(not X
), if I have four observations for each Y (for example for 2281223 with months 6,9,12 of 2010 and month 3 of 2011). If I have less than four observations (for example for Y =228130), I want to simply ignore it. I use the following commands in R
for the above purpose:
require(plyr) # the data are in the data csv file data<-read.csv("data.csv") # convert Y (integers) into factors y<-as.factor(y) # get the count of each unique Y count<-ddply(data,.(Y), summarize, freq=length(Y)) # get the sum of each unique Y sum<-ddply(data,.(Y),summarize,tot=sum(income)) # show the sum if number of observations for each Y is less than 4 colbind<-cbind(count,sum) finalsum<-subset(colbind,freq>3)
My output are as follows:
>colbind Y freq Y tot 1 228120 1 228120 1000 2 228121 3 228121 11000 3 228122 4 228122 6778 4 228124 2 228124 1311 5 228128 3 228128 14189 6 228130 3 228130 5100 7 228135 3 228135 5973 >finalsum Y freq Y.1 tot 3 228122 4 228122 6778
The above code works, but requires many steps. So,I would like to know whether there is a simple way of performing the above task (using the plyr package).
Now we can use the group_by and the summarise_at functions to get the summation by group: iris %>% # Specify data frame group_by(Species) %>% # Specify group indicator summarise_at(vars(Sepal. Length), # Specify column list(name = sum)) # Specify function # A tibble: 3 x 2 # Species name # <fct> <dbl> # 1 setosa 250.
ddply: Split data frame, apply function, and return results in a data frame.
The group_by() method in tidyverse can be used to accomplish this. When working with categorical variables, you may use the group_by() method to divide the data into subgroups based on the variable's distinct categories.
As pointed out in a comment, you can do multiple operations inside the summarize
.
This reduces your code to one line of ddply()
and one line of subsetting, which is easy enough with the [
operator:
x <- ddply(data, .(Y), summarize, freq=length(Y), tot=sum(income)) x[x$freq > 3, ] Y freq tot 3 228122 4 6778
This is also exceptionally easy with the data.table
package:
library(data.table) data.table(data)[, list(freq=length(income), tot=sum(income)), by=Y][freq > 3] Y freq tot 1: 228122 4 6778
In fact, the operation to calculate the length of a vector has its own shortcut in data.table
- use the .N
shortcut:
data.table(data)[, list(freq=.N, tot=sum(income)), by=Y][freq > 3] Y freq tot 1: 228122 4 6778
I think the package dplyr
is faster than plyr::ddply
and more elegant.
testData <- read.table(file = "clipboard",header = TRUE) require(dplyr) testData %>% group_by(Y) %>% summarise(total = sum(income),freq = n()) %>% filter(freq > 3)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With