Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dplyr Use pivot_longer and pivot_wider on subset of variable

Tags:

r

dplyr

Is there a way to use pivot_longer and pivot_wider on a subset of a variable? Here's an example. First, I'll create a data frame with the desired starting structure.

library(tidyverse)

# Assume this as starting df
arrests <- USArrests %>% 
  as_tibble(rownames = "State") %>% 
  pivot_longer(-State, names_to = "Crime", values_to = "Value") %>% 
  group_by(State) %>% 
  mutate(Total = sum(Value)) %>% 
  ungroup()

arrests
# A tibble: 200 x 4
   State   Crime    Value Total
   <chr>   <chr>    <dbl> <dbl>
 1 Alabama Murder    13.2  328.
 2 Alabama Assault  236    328.
 3 Alabama UrbanPop  58    328.
 4 Alabama Rape      21.2  328.
 5 Alaska  Murder    10    366.
 6 Alaska  Assault  263    366.
 7 Alaska  UrbanPop  48    366.
 8 Alaska  Rape      44.5  366.
 9 Arizona Murder     8.1  413.
10 Arizona Assault  294    413.
# ... with 190 more rows

So we are using the arrest data frame. Now I would like fold "Total" into "Crime" so that "Total" is a value within Crime, just like "Murder."

I would also like to do the reverse. After "Total" is folded into "Crime", I want to use pivot_wider on "Crime" but only on values where Crime == "Total".

Are these actions possible?

like image 903
hmhensen Avatar asked Jan 25 '23 05:01

hmhensen


2 Answers

One option is add_row. After doing a group split by 'State', loop over the list with map and add a row (add_row from tibble) with the first value of 'Total' column and remove the 'Total' column

library(dplyr)
library(purrr)
library(tibble)
arrests2 <- arrests %>%
         group_split(State) %>%
         map_dfr(~ .x %>% 
               add_row(State = .$State[1], Crime = 'Total',
                        Value = .$Total[1]) %>%
                select(-Total))
arrests2
# A tibble: 250 x 3
#  State   Crime    Value
# * <chr>   <chr>    <dbl>
# 1 Alabama Murder    13.2
# 2 Alabama Assault  236  
# 3 Alabama UrbanPop  58  
# 4 Alabama Rape      21.2
# 5 Alabama Total    328. 
# 6 Alaska  Murder    10  
# 7 Alaska  Assault  263  
# 8 Alaska  UrbanPop  48  
# 9 Alaska  Rape      44.5
#10 Alaska  Total    366. 
# … with 240 more rows

Or another option is to summarise with the 'Total' value and then do a bind_rows

arrests %>% 
   group_by(State) %>% 
   summarise(Crime = 'Total', Value = first(Total)) %>% 
   bind_rows(arrests %>% select(-Total), .)  %>% 
   arrange(State)

Or using pivot_longer

library(tidyr)
arrests %>%
    pivot_longer(cols = Value:Total) %>% 
    mutate(Crime = replace(Crime, name == 'Total', 'Total')) %>% 
    select(-name) %>%
    distinct()
# A tibble: 250 x 3
#   State   Crime    value
#   <chr>   <chr>    <dbl>
# 1 Alabama Murder    13.2
# 2 Alabama Total    328. 
# 3 Alabama Assault  236  
# 4 Alabama UrbanPop  58  
# 5 Alabama Rape      21.2
# 6 Alaska  Murder    10  
# 7 Alaska  Total    366. 
# 8 Alaska  Assault  263  
# 9 Alaska  UrbanPop  48  
#10 Alaska  Rape      44.5
# … with 240 more rows

If we need to do the reverse, then grouped by 'State', create the 'Total' column by extracting the 'Value' that corresponds to 'Crime' as 'Total', and filter out the row where the Crime is 'Total'

arrests2 %>%
    group_by(State) %>% 
    mutate(Total = Value[Crime == 'Total'])  %>%
    filter(Crime != 'Total')
# A tibble: 200 x 4
# Groups:   State [50]
#   State   Crime    Value Total
#   <chr>   <chr>    <dbl> <dbl>
# 1 Alabama Murder    13.2  328.
# 2 Alabama Assault  236    328.
# 3 Alabama UrbanPop  58    328.
# 4 Alabama Rape      21.2  328.
# 5 Alaska  Murder    10    366.
# 6 Alaska  Assault  263    366.
# 7 Alaska  UrbanPop  48    366.
# 8 Alaska  Rape      44.5  366.
# 9 Arizona Murder     8.1  413.
#10 Arizona Assault  294    413.
# … with 190 more rows
like image 108
akrun Avatar answered Apr 07 '23 10:04

akrun


1) janitor Use adorn_totals from the janitor package ignoring the Total column. Note that within a group_by section that dot refers to the entire data set, not just that group, unless we refer to it within a do which is why we use do.

library(janitor)

res1 <- arrests %>%
  select(-Total) %>%
  group_by(State) %>%
  do(adorn_totals(select(., -State), "row")) %>%
  ungroup
res1

giving:

# A tibble: 250 x 3
   State   Crime    Value
   <chr>   <chr>    <dbl>
 1 Alabama Murder    13.2
 2 Alabama Assault  236  
 3 Alabama UrbanPop  58  
 4 Alabama Rape      21.2
 5 Alabama Total    328. 
 6 Alaska  Murder    10  
 7 Alaska  Assault  263  
 8 Alaska  UrbanPop  48  
 9 Alaska  Rape      44.5
10 Alaska  Total    366. 
# ... with 240 more rows

We can remove the Total rows and add a column

res1 %>% {
  left <- filter(., Crime != "Total")
  right <- filter(., Crime == "Total") %>% select(State, Total = Value)
  left_join(left, right, by = "State")
}

2) reshape2 The reshape2 package is a forerunner of the pivot_* functions. It does have margins functionality built in which seems not to have been continued in subsequent iterations in spread/gather and pivot_*. This also works if we replace the library statement with library(data.table) .

library(reshape2)

res2 <- dcast(arrests, State + Crime ~ "Value", fun.aggregate = sum, 
  value.var = "Value", margins = "Crime")
res2

giving:

             State    Crime Value
1          Alabama  Assault 236.0
2          Alabama   Murder  13.2
3          Alabama     Rape  21.2
4          Alabama UrbanPop  58.0
5          Alabama    (all) 328.4
6           Alaska  Assault 263.0
7           Alaska   Murder  10.0
8           Alaska     Rape  44.5
9           Alaska UrbanPop  48.0
10          Alaska    (all) 365.5
...etc...

To create a Total column and remove the total rows, create a factor that identifies each row as a Value or Total row and then dcast the result to wide form filling in NAs with na.locf.

library(reshape2)
library(zoo)

fac <- factor(res$Crime == '(all)', labels = c("Value", "Total"))
dc <- dcast(res2, State + Crime ~ fac, value.var = "Value")
subset(na.locf(dc, fromLast = TRUE), Crime != '(all)')

or

left <- subset(res2, Crime != "(all)")
right <- subset(res2, Crime == "(all)", c(State, Value))
names(right) <- c("State", "Total")
merge(left, right, by = "State")

3) sqldf To use SQL add a level column which is 0 for detail records and 1 for Total records and then union the details and totals and sort.

library(sqldf)
res3 <- sqldf("select State, Crime, Value from (
  select 0 as level, State, Crime, Value from arrests
  union
  select 1 as level, State, 'Total' as Crime, sum(Value) as Total from arrests
  group by State)
  order by State, level")

To remove the total rows and insert a Total column

sqldf("select State, Crime, Value, Total
  from res3 a
  left join (
     select State, sum(Value) as Total 
       from res3 
       where Crime != 'Total' 
       group by State) using (State)
  where Crime != 'Total'")

4) Base R This is straight forward in base R using xtabs and addmargins.

Total <- sum
tab <- addmargins(xtabs(Value ~ State + Crime, arrests), 2, FUN = Total)
DF <- as.data.frame(tab, responseName = "Value")
res3 <- DF[order(DF$State, DF$Crime == "Total"), ]

and modifying (2) we can use the following to remove the Total rows and add a Total column:

left <- subset(res3, Crime != "Total")
right <- subset(res3, Crime == "Total", c(State, Value))
names(right) <- c("State", "Total")
merge(left, right, by = "State")
like image 27
G. Grothendieck Avatar answered Apr 07 '23 09:04

G. Grothendieck