Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sum by distinct column value in R

I have a very large dataframe in R and would like to sum two columns for every distinct value in other columns, for example say we had data of a dataframe of transactions in various shops over a day as follows

shop <- data.frame('shop_id' = c(1, 1, 1, 2, 3, 3), 
  'shop_name' = c('Shop A', 'Shop A', 'Shop A', 'Shop B', 'Shop C', 'Shop C'), 
  'city' = c('London', 'London', 'London', 'Cardiff', 'Dublin', 'Dublin'), 
  'sale' = c(12, 5, 9, 15, 10, 18), 
  'profit' = c(3, 1, 3, 6, 5, 9))

which is:

shop_id  shop_name    city      sale profit
   1     Shop A       London    12   3
   1     Shop A       London    5    1
   1     Shop A       London    9    3
   2     Shop B       Cardiff   15   6
   3     Shop C       Dublin    10   5
   3     Shop C       Dublin    18   9

And I'd want to sum the sale and profit for each shop to give:

shop_id  shop_name    city      sale profit
   1     Shop A       London    26   7
   2     Shop B       Cardiff   15   6
   3     Shop C       Dublin    28   14

I am currently using the following code to do this:

 shop_day <-ddply(shop, "shop_id", transform, sale=sum(sale), profit=sum(profit))
 shop_day <- subset(shop_day, !duplicated(shop_id))

which works absolutely fine, but as I said my dataframe is large (140,000 rows, 37 columns and nearly 100,000 unique rows which I want to sum) and my code takes ages to run and then eventually says it has run out of memory.

Does anyone know of the most efficient way to do this.

Thanks in advance!

like image 363
user1165199 Avatar asked Aug 02 '12 16:08

user1165199


1 Answers

I think the neatest way to do this is in dplyr

library(dplyr)
shop %>% 
  group_by(shop_id, shop_name, city) %>% 
  summarise_all(sum)
like image 140
user1165199 Avatar answered Oct 12 '22 23:10

user1165199