I have a sample dataframe "data" as follows: <pre class="prettyprint"><code>X Y Month Year income 2281205 228120 3 2011 1000 2281212 228121 9 2010 1100 2281213 228121 12 2010 900 2281214 228121 3 2011 9000 2281222 228122 6 2010 1111 2281223 228122 9 2010 3000 2281224 228122 12 2010 1889 2281225 228122 3 2011 778 2281243 228124 12 2010 1111 2281244 228124 3 2011 200 2281282 228128 9 2010 7889 2281283 228128 12 2010 2900 2281284 228128 3 2011 3400 2281302 228130 9 2010 1200 2281303 228130 12 2010 2000 2281304 228130 3 2011 1900 2281352 228135 9 2010 2300 2281353 228135 12 2010 1333 2281354 228135 3 2011 2340 </code></pre> I want to use the <code>ddply</code> to compute the income for each <code>Y</code>(not <code>X</code>), if I have four observations for each Y (for example for 2281223 with months 6,9,12 of 2010 and month 3 of 2011). If I have less than four observations (for example for Y =228130), I want to simply ignore it. I use the following commands in <code>R</code> for the above purpose: <pre class="prettyprint"><code>require(plyr) # the data are in the data csv file data<-read.csv("data.csv") # convert Y (integers) into factors y<-as.factor(y) # get the count of each unique Y count<-ddply(data,.(Y), summarize, freq=length(Y)) # get the sum of each unique Y sum<-ddply(data,.(Y),summarize,tot=sum(income)) # show the sum if number of observations for each Y is less than 4 colbind<-cbind(count,sum) finalsum<-subset(colbind,freq>3) </code></pre> My output are as follows: <pre class="prettyprint"><code>>colbind Y freq Y tot 1 228120 1 228120 1000 2 228121 3 228121 11000 3 228122 4 228122 6778 4 228124 2 228124 1311 5 228128 3 228128 14189 6 228130 3 228130 5100 7 228135 3 228135 5973 >finalsum Y freq Y.1 tot 3 228122 4 228122 6778 </code></pre> The above code works, but requires many steps. So,I would like to know whether there is a simple way of performing the above task (using the plyr package).

As pointed out in a comment, you can do multiple operations inside the <code>summarize</code>. This reduces your code to one line of <code>ddply()</code> and one line of subsetting, which is easy enough with the <code>[</code> operator: <pre class="prettyprint"><code>x <- ddply(data, .(Y), summarize, freq=length(Y), tot=sum(income)) x[x$freq > 3, ] Y freq tot 3 228122 4 6778 </code></pre> <hr> This is also exceptionally easy with the <code>data.table</code> package: <pre class="prettyprint"><code>library(data.table) data.table(data)[, list(freq=length(income), tot=sum(income)), by=Y][freq > 3] Y freq tot 1: 228122 4 6778 </code></pre> In fact, the operation to calculate the length of a vector has its own shortcut in <code>data.table</code> - use the <code>.N</code> shortcut: <pre class="prettyprint"><code>data.table(data)[, list(freq=.N, tot=sum(income)), by=Y][freq > 3] Y freq tot 1: 228122 4 6778 </code></pre>

I think the package <code>dplyr</code> is faster than <code>plyr::ddply</code> and more elegant. <pre class="prettyprint"><code>testData <- read.table(file = "clipboard",header = TRUE) require(dplyr) testData %>% group_by(Y) %>% summarise(total = sum(income),freq = n()) %>% filter(freq > 3) </code></pre>

ddply for sum by group in R

Tags:

I have a sample dataframe "data" as follows:

X            Y  Month   Year    income 2281205 228120  3   2011    1000 2281212 228121  9   2010    1100 2281213 228121  12  2010    900 2281214 228121  3   2011    9000 2281222 228122  6   2010    1111 2281223 228122  9   2010    3000 2281224 228122  12  2010    1889 2281225 228122  3   2011    778 2281243 228124  12  2010    1111 2281244 228124  3   2011    200 2281282 228128  9   2010    7889 2281283 228128  12  2010    2900 2281284 228128  3   2011    3400 2281302 228130  9   2010    1200 2281303 228130  12  2010    2000 2281304 228130  3   2011    1900 2281352 228135  9   2010    2300 2281353 228135  12  2010    1333 2281354 228135  3   2011    2340

I want to use the ddply to compute the income for each Y(not X), if I have four observations for each Y (for example for 2281223 with months 6,9,12 of 2010 and month 3 of 2011). If I have less than four observations (for example for Y =228130), I want to simply ignore it. I use the following commands in R for the above purpose:

require(plyr)      # the data are in the data csv file     data<-read.csv("data.csv")     # convert Y (integers) into factors     y<-as.factor(y)     # get the count of each unique Y     count<-ddply(data,.(Y), summarize, freq=length(Y))     # get the sum of each unique Y      sum<-ddply(data,.(Y),summarize,tot=sum(income))     # show the sum if number of observations for each Y is less than 4     colbind<-cbind(count,sum)     finalsum<-subset(colbind,freq>3)

My output are as follows:

>colbind        Y freq      Y   tot 1 228120    1 228120  1000 2 228121    3 228121 11000 3 228122    4 228122  6778 4 228124    2 228124  1311 5 228128    3 228128 14189 6 228130    3 228130  5100 7 228135    3 228135  5973 >finalsum        Y freq    Y.1  tot 3 228122    4 228122 6778

The above code works, but requires many steps. So,I would like to know whether there is a simple way of performing the above task (using the plyr package).

942

asked Dec 26 '12 04:12

Metrics

2 Answers

As pointed out in a comment, you can do multiple operations inside the summarize.

This reduces your code to one line of ddply() and one line of subsetting, which is easy enough with the [ operator:

x <- ddply(data, .(Y), summarize, freq=length(Y), tot=sum(income)) x[x$freq > 3, ]         Y freq  tot 3 228122    4 6778

This is also exceptionally easy with the data.table package:

library(data.table) data.table(data)[, list(freq=length(income), tot=sum(income)), by=Y][freq > 3]         Y freq  tot 1: 228122    4 6778

In fact, the operation to calculate the length of a vector has its own shortcut in data.table - use the .N shortcut:

data.table(data)[, list(freq=.N, tot=sum(income)), by=Y][freq > 3]         Y freq  tot 1: 228122    4 6778

answered Oct 27 '22 10:10

Andrie

I think the package dplyr is faster than plyr::ddply and more elegant.

testData <- read.table(file = "clipboard",header = TRUE) require(dplyr) testData %>%   group_by(Y) %>%   summarise(total = sum(income),freq = n()) %>%   filter(freq > 3)

answered Oct 27 '22 10:10

HatMatrix

Related questions
                            
                                how to use xargs with sed in search pattern
                            
                                smoothScrollToPositionFromTop() is not always working like it should
                            
                                cleanest way to restart jboss using a script
                            
                                How to resize Title in a navigation bar dynamically
                            
                                npm install - how to run build scripts with sufficient permissions?
                            
                                Optional UriTemplate parameter using WebGet
                            
                                with SIP, when to use TCP not UDP?
                            
                                Call Function with Delay When Textbox Changes in AngularJS
                            
                                Setting up different git roots for different modules in same project - Intellij IDEA
                            
                                Compiler error C2653: not a class or namespace name
                            
                                underscore to find min and max of object value
                            
                                Highcharts - Hidden charts don't get re-size properly

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With