I have a very large dataframe in R and would like to sum two columns for every distinct value in other columns, for example say we had data of a dataframe of transactions in various shops over a day as follows
shop <- data.frame('shop_id' = c(1, 1, 1, 2, 3, 3),
'shop_name' = c('Shop A', 'Shop A', 'Shop A', 'Shop B', 'Shop C', 'Shop C'),
'city' = c('London', 'London', 'London', 'Cardiff', 'Dublin', 'Dublin'),
'sale' = c(12, 5, 9, 15, 10, 18),
'profit' = c(3, 1, 3, 6, 5, 9))
which is:
shop_id shop_name city sale profit
1 Shop A London 12 3
1 Shop A London 5 1
1 Shop A London 9 3
2 Shop B Cardiff 15 6
3 Shop C Dublin 10 5
3 Shop C Dublin 18 9
And I'd want to sum the sale and profit for each shop to give:
shop_id shop_name city sale profit
1 Shop A London 26 7
2 Shop B Cardiff 15 6
3 Shop C Dublin 28 14
I am currently using the following code to do this:
shop_day <-ddply(shop, "shop_id", transform, sale=sum(sale), profit=sum(profit))
shop_day <- subset(shop_day, !duplicated(shop_id))
which works absolutely fine, but as I said my dataframe is large (140,000 rows, 37 columns and nearly 100,000 unique rows which I want to sum) and my code takes ages to run and then eventually says it has run out of memory.
Does anyone know of the most efficient way to do this.
Thanks in advance!
I think the neatest way to do this is in dplyr
library(dplyr)
shop %>%
group_by(shop_id, shop_name, city) %>%
summarise_all(sum)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With