Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: How to spread, group_by, summarise and mutate at the same time

Tags:

r

dplyr

tidyr

I want to spread this data below (first 12 rows shown here only) by the column 'Year', returning the sum of 'Orders' grouped by 'CountryName'. Then calculate the % change in 'Orders' for each 'CountryName' from 2014 to 2015.

CountryName     Days        pCountry     Revenue    Orders  Year
United Kingdom  0-1 days    India        2604.799   13      2014
Norway          8-14 days   Australia    5631.123   9       2015
US              31-45 days  UAE          970.8324   2       2014
United Kingdom  4-7 days    Austria      94.3814    1       2015
Norway          8-14 days   Slovenia     939.8392   3       2014
South Korea     46-60 days  Germany      1959.4199  15      2014
UK              8-14 days   Poland       1394.9096  6.      2015
UK              61-90 days  Lithuania   -170.8035   -1      2015
US              8-14 days   Belize       1687.68    5       2014
Australia       46-60 days  Chile        888.72 2.  0       2014
US              15-30 days  Turkey       2320.7355  8       2014
Australia       0-1 days    Hong Kong    672.1099   2       2015

I can make this work with a smaller test dataframe, but can only seem to return endless errors like 'sum not meaningful for factors' or 'duplicate identifiers for rows' with the full data. After hours of reading the dplyr docs and trying things I've given up. Can anyone help with this code...

data %>% 
  spread(Year, Orders) %>% 
  group_by(CountryName) %>%
  summarise_all(.funs=c(Sum='sum'), na.rm=TRUE) %>% 
  mutate(percent_inc=100*((`2014_Sum`-`2015_Sum`)/`2014_Sum`))

The expected output would be a table similar to below. (Note: these numbers are for illustrative purposes, they are not hand calculated.)

CountryName  percent_inc
UK           34.2
US           28.2
Norway       36.1
...          ...

Edit

I had to make a few edits to the variable names, please note.

like image 741
RDJ Avatar asked Jun 21 '17 22:06

RDJ


People also ask

Can you group_by two variables in R?

One great feature of the group_by function is its ability to group by more than one variable to show what the aggregated data looks like for combinations of the different variables across the response variable.

How do I apply a function across a column in R?

apply() lets you perform a function across a data frame's rows or columns. In the arguments, you specify what you want as follows: apply(X = data. frame, MARGIN = 1, FUN = function. you.

Is dplyr a Tidyr?

Similarly to readr , dplyr and tidyr are also part of the tidyverse. These packages were loaded in R's memory when we called library(tidyverse) earlier.

What does group by do in dplyr?

group_by() takes an existing tbl and converts it into a grouped tbl where operations are performed "by group". ungroup() removes grouping.


2 Answers

Sum first, while your data are still in long format, then spread. Here's an example with fake data:

set.seed(2)
dat = data.frame(Country=sample(LETTERS[1:5], 500, replace=TRUE), 
                 Year = sample(2014:2015, 500, replace=TRUE),
                 Orders = sample(-1:20, 500, replace=TRUE))

dat %>% group_by(Country, Year) %>% 
  summarise(sum_orders = sum(Orders, na.rm=TRUE)) %>% 
  spread(Year, sum_orders) %>%
  mutate(Pct = (`2014` - `2015`)/`2014` * 100)
  Country `2014` `2015`        Pct
1       A    575    599  -4.173913
2       B    457    486  -6.345733
3       C    481    319  33.679834
4       D    423    481 -13.711584
5       E    528    551  -4.356061

If you have multiple years, it's probably easier to just keep it in long format until you're ready to make a nice output table:

set.seed(2)
dat = data.frame(Country=sample(LETTERS[1:5], 500, replace=TRUE), 
                 Year = sample(2010:2015, 500, replace=TRUE),
                 Orders = sample(-1:20, 500, replace=TRUE))

dat %>% group_by(Country, Year) %>% 
  summarise(sum_orders = sum(Orders, na.rm=TRUE)) %>% 
  group_by(Country) %>%
  arrange(Country, Year) %>%
  mutate(Pct = c(NA, -diff(sum_orders))/lag(sum_orders) * 100) 
   Country  Year sum_orders        Pct
    <fctr> <int>      <int>      <dbl>
 1       A  2010        205         NA
 2       A  2011        144  29.756098
 3       A  2012        226 -56.944444
 4       A  2013        119  47.345133
 5       A  2014        177 -48.739496
 6       A  2015        303 -71.186441
 7       B  2010        146         NA
 8       B  2011        159  -8.904110
 9       B  2012        152   4.402516
10       B  2013        180 -18.421053
# ... with 20 more rows
like image 119
eipi10 Avatar answered Sep 21 '22 01:09

eipi10


This is not an answer because you haven't really asked a reproducible question, but just to help out.

Error 1 You're getting this error duplicate identifiers for rows likely because of spread. spread wants to make N columns of your N unique values but it needs to know which unique row to place those values. If you have duplicate value-combinations, for instance:

   CountryName      Days        pCountry         Revenue
United Kingdom  0-1 days           India        2604.799
United Kingdom  0-1 days           India        2604.799

shows up twice, then spread gets confused which row it should place the data in. The quick fix is to data %>% mutate(row=row_number()) %>% spread... before spread.

Error 2 You're getting this error sum not meaningful for factors likely because of summarise_all. summarise_all will operate on all columns but some columns contain strings (or factors). What does United Kingdom + United Kingdom equal? Try instead summarise(2014_Sum = sum(2014), 2015_Sum = sum(2015)).

like image 32
CPak Avatar answered Sep 19 '22 01:09

CPak