I want to <code>spread</code> this data below (first 12 rows shown here only) by the column 'Year', returning the sum of 'Orders' grouped by 'CountryName'. Then calculate the % change in 'Orders' for each 'CountryName' from 2014 to 2015. <pre class="prettyprint"><code>CountryName Days pCountry Revenue Orders Year United Kingdom 0-1 days India 2604.799 13 2014 Norway 8-14 days Australia 5631.123 9 2015 US 31-45 days UAE 970.8324 2 2014 United Kingdom 4-7 days Austria 94.3814 1 2015 Norway 8-14 days Slovenia 939.8392 3 2014 South Korea 46-60 days Germany 1959.4199 15 2014 UK 8-14 days Poland 1394.9096 6. 2015 UK 61-90 days Lithuania -170.8035 -1 2015 US 8-14 days Belize 1687.68 5 2014 Australia 46-60 days Chile 888.72 2. 0 2014 US 15-30 days Turkey 2320.7355 8 2014 Australia 0-1 days Hong Kong 672.1099 2 2015 </code></pre> I can make this work with a smaller test dataframe, but can only seem to return endless errors like 'sum not meaningful for factors' or 'duplicate identifiers for rows' with the full data. After hours of reading the dplyr docs and trying things I've given up. Can anyone help with this code... <pre class="prettyprint"><code>data %>% spread(Year, Orders) %>% group_by(CountryName) %>% summarise_all(.funs=c(Sum='sum'), na.rm=TRUE) %>% mutate(percent_inc=100*((`2014_Sum`-`2015_Sum`)/`2014_Sum`)) </code></pre> The expected output would be a table similar to below. (Note: these numbers are for illustrative purposes, they are not hand calculated.) <pre class="prettyprint"><code>CountryName percent_inc UK 34.2 US 28.2 Norway 36.1 ... ... </code></pre> Edit I had to make a few edits to the variable names, please note.

Sum first, while your data are still in long format, then spread. Here's an example with fake data: <pre class="prettyprint"><code>set.seed(2) dat = data.frame(Country=sample(LETTERS[1:5], 500, replace=TRUE), Year = sample(2014:2015, 500, replace=TRUE), Orders = sample(-1:20, 500, replace=TRUE)) dat %>% group_by(Country, Year) %>% summarise(sum_orders = sum(Orders, na.rm=TRUE)) %>% spread(Year, sum_orders) %>% mutate(Pct = (`2014` - `2015`)/`2014` * 100) </code></pre> <blockquote> <pre class="prettyprint"><code> Country `2014` `2015` Pct 1 A 575 599 -4.173913 2 B 457 486 -6.345733 3 C 481 319 33.679834 4 D 423 481 -13.711584 5 E 528 551 -4.356061 </code></pre> </blockquote> If you have multiple years, it's probably easier to just keep it in long format until you're ready to make a nice output table: <pre class="prettyprint"><code>set.seed(2) dat = data.frame(Country=sample(LETTERS[1:5], 500, replace=TRUE), Year = sample(2010:2015, 500, replace=TRUE), Orders = sample(-1:20, 500, replace=TRUE)) dat %>% group_by(Country, Year) %>% summarise(sum_orders = sum(Orders, na.rm=TRUE)) %>% group_by(Country) %>% arrange(Country, Year) %>% mutate(Pct = c(NA, -diff(sum_orders))/lag(sum_orders) * 100) </code></pre> <blockquote> <pre class="prettyprint"><code> Country Year sum_orders Pct <fctr> <int> <int> <dbl> 1 A 2010 205 NA 2 A 2011 144 29.756098 3 A 2012 226 -56.944444 4 A 2013 119 47.345133 5 A 2014 177 -48.739496 6 A 2015 303 -71.186441 7 B 2010 146 NA 8 B 2011 159 -8.904110 9 B 2012 152 4.402516 10 B 2013 180 -18.421053 # ... with 20 more rows </code></pre> </blockquote>

This is not an answer because you haven't really asked a reproducible question, but just to help out. Error 1 You're getting this error <code>duplicate identifiers for rows</code> likely because of <code>spread</code>. <code>spread</code> wants to make <code>N</code> columns of your <code>N</code> unique values but it needs to know which unique row to place those values. If you have duplicate value-combinations, for instance: <pre class="prettyprint"><code> CountryName Days pCountry Revenue United Kingdom 0-1 days India 2604.799 United Kingdom 0-1 days India 2604.799 </code></pre> shows up twice, then <code>spread</code> gets confused which row it should place the data in. The quick fix is to <code>data %>% mutate(row=row_number()) %>% spread...</code> before <code>spread</code>. Error 2 You're getting this error <code>sum not meaningful for factors</code> likely because of <code>summarise_all</code>. <code>summarise_all</code> will operate on all columns but some columns contain strings (or factors). What does <code>United Kingdom + United Kingdom</code> equal? Try instead <code>summarise(2014_Sum = sum(2014), 2015_Sum = sum(2015))</code>.

R: How to spread, group_by, summarise and mutate at the same time

Tags:

r

dplyr

tidyr

I want to spread this data below (first 12 rows shown here only) by the column 'Year', returning the sum of 'Orders' grouped by 'CountryName'. Then calculate the % change in 'Orders' for each 'CountryName' from 2014 to 2015.

CountryName     Days        pCountry     Revenue    Orders  Year
United Kingdom  0-1 days    India        2604.799   13      2014
Norway          8-14 days   Australia    5631.123   9       2015
US              31-45 days  UAE          970.8324   2       2014
United Kingdom  4-7 days    Austria      94.3814    1       2015
Norway          8-14 days   Slovenia     939.8392   3       2014
South Korea     46-60 days  Germany      1959.4199  15      2014
UK              8-14 days   Poland       1394.9096  6.      2015
UK              61-90 days  Lithuania   -170.8035   -1      2015
US              8-14 days   Belize       1687.68    5       2014
Australia       46-60 days  Chile        888.72 2.  0       2014
US              15-30 days  Turkey       2320.7355  8       2014
Australia       0-1 days    Hong Kong    672.1099   2       2015

I can make this work with a smaller test dataframe, but can only seem to return endless errors like 'sum not meaningful for factors' or 'duplicate identifiers for rows' with the full data. After hours of reading the dplyr docs and trying things I've given up. Can anyone help with this code...

data %>% 
  spread(Year, Orders) %>% 
  group_by(CountryName) %>%
  summarise_all(.funs=c(Sum='sum'), na.rm=TRUE) %>% 
  mutate(percent_inc=100*((`2014_Sum`-`2015_Sum`)/`2014_Sum`))

The expected output would be a table similar to below. (Note: these numbers are for illustrative purposes, they are not hand calculated.)

CountryName  percent_inc
UK           34.2
US           28.2
Norway       36.1
...          ...

Edit

I had to make a few edits to the variable names, please note.

741

asked Jun 21 '17 22:06

RDJ

2 Answers

Sum first, while your data are still in long format, then spread. Here's an example with fake data:

set.seed(2)
dat = data.frame(Country=sample(LETTERS[1:5], 500, replace=TRUE), 
                 Year = sample(2014:2015, 500, replace=TRUE),
                 Orders = sample(-1:20, 500, replace=TRUE))

dat %>% group_by(Country, Year) %>% 
  summarise(sum_orders = sum(Orders, na.rm=TRUE)) %>% 
  spread(Year, sum_orders) %>%
  mutate(Pct = (`2014` - `2015`)/`2014` * 100)

  Country `2014` `2015`        Pct
1       A    575    599  -4.173913
2       B    457    486  -6.345733
3       C    481    319  33.679834
4       D    423    481 -13.711584
5       E    528    551  -4.356061

If you have multiple years, it's probably easier to just keep it in long format until you're ready to make a nice output table:

set.seed(2)
dat = data.frame(Country=sample(LETTERS[1:5], 500, replace=TRUE), 
                 Year = sample(2010:2015, 500, replace=TRUE),
                 Orders = sample(-1:20, 500, replace=TRUE))

dat %>% group_by(Country, Year) %>% 
  summarise(sum_orders = sum(Orders, na.rm=TRUE)) %>% 
  group_by(Country) %>%
  arrange(Country, Year) %>%
  mutate(Pct = c(NA, -diff(sum_orders))/lag(sum_orders) * 100)

   Country  Year sum_orders        Pct
    <fctr> <int>      <int>      <dbl>
 1       A  2010        205         NA
 2       A  2011        144  29.756098
 3       A  2012        226 -56.944444
 4       A  2013        119  47.345133
 5       A  2014        177 -48.739496
 6       A  2015        303 -71.186441
 7       B  2010        146         NA
 8       B  2011        159  -8.904110
 9       B  2012        152   4.402516
10       B  2013        180 -18.421053
# ... with 20 more rows

119

answered Sep 21 '22 01:09

eipi10

This is not an answer because you haven't really asked a reproducible question, but just to help out.

Error 1 You're getting this error duplicate identifiers for rows likely because of spread. spread wants to make N columns of your N unique values but it needs to know which unique row to place those values. If you have duplicate value-combinations, for instance:

   CountryName      Days        pCountry         Revenue
United Kingdom  0-1 days           India        2604.799
United Kingdom  0-1 days           India        2604.799

shows up twice, then spread gets confused which row it should place the data in. The quick fix is to data %>% mutate(row=row_number()) %>% spread... before spread.

Error 2 You're getting this error sum not meaningful for factors likely because of summarise_all. summarise_all will operate on all columns but some columns contain strings (or factors). What does United Kingdom + United Kingdom equal? Try instead summarise(2014_Sum = sum(2014), 2015_Sum = sum(2015)).

answered Sep 19 '22 01:09

CPak

Related questions
                            
                                How do I change NA to 0 with lapply()?
                            
                                Shiny widget to change the order of elements in a vector
                            
                                Unit tests for deprecated functions in R package cause warnings during check
                            
                                S3 operator overloading for multiple classes
                            
                                Time Series Forecasting using Random Forest in R
                            
                                mgcv: how to specify interaction between smooth and factor?
                            
                                How to pass column names into a function dplyr
                            
                                How to change background of Leaflet map in RStudio?
                            
                                Replace value in data frame with value from other data frame based on set of conditions
                            
                                Remove consecutive duplicate values per row in R
                            
                                change color actionButton Shiny R
                            
                                In sweave, xtable, only rotate some column names
                            
                                How can I add logo besides project name in shiny dashboard?
                            
                                Create a reactive function outside the shiny app
                            
                                Plotly - Shape positions using subplot
                            
                                Render code for google chart in rmarkdown Rmd
                            
                                R data.table roll="nearest" not actually nearest
                            
                                Reading data files in blogdown
                            
                                How to combine two different dplyr summaries in a single command
                            
                                R ggplot ordering bars within groups

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With