Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pivoting data with varying width from wide to long with flexible call (to be used in loop)

I need to pivot some wide time-series data, that is varying width regard to width using tidyr's pivot_longer().

The data is quarterly, but I receive the data in both years-blocks (with with four quarters) and in six months blocks (with only two quarters), i.e. the data is varying with regard to width.

I would like to find a simple and flexible solution that can be used in a loop, as I need to import many years and six months blocks (and, as I need to convince my research group to use R, I am asking here for a simple, smart, and clean solution using (preferably) tidyverse).

The data looks kinda like this in the years-blocks,

dta_wide1 <- structure(list(V1 = c("", "", "", "", "", "", "", "peach", "dragonfruit", "honeydew", "huckleberry", "", ""), V2 = c("ABC", "some info", "Store A", "", "As of 31/03/2019", "label1", "", "7", "5", "6", "1", "(a) some useless clutter", "(b) more not relevent information"), V3 = c("", "", "", "", "", "", "label2", "0.5", "0.4", "0.8", "0.3", "", ""), V4 = c("", "", "", "", "", "label4", "label4a", "21", "21", "87", "21", "", ""), V5 = c("", "", "", "", "", "", "label4b", "0.3", "0.1", "0.4", "0.2", "", ""), V6 = c("", "", "", "", "As of 30/06/2019", "label1", "", "5", "2", "3", "7", "", ""), V7 = c("", "", "", "", "", "", "label2", "0.46", "0.72", "0.7", "0.8", "", ""), V8 = c("", "", "", "", "", "label4", "label4a", "19", "22", "85", "25", "", ""), V9 = c("", "", "", "", "", "", "label4b", "0.4", "0.1", "0.3", "0.2", "", ""), V10 = c("", "", "", "", "As of 30/09/2019", "label1", "", "4", "1", "4", "8", "", ""), V11 = c("", "", "", "", "", "", "label2", "0.1", "0.3", "0.6", "0.22", "", ""), V12 = c("", "", "", "", "", "label4", "label4a", "21", "23", "71", "27", "", ""), V13 = c("", "", "", "", "", "", "label4b", "0.3", "0.1", "0.4", "0.2", "", ""), V14 = c("", "", "", "", "As of 31/12/2019", "label1", "", "8", "6", "9", "9", "", ""), V15 = c("", "", "", "", "", "", "label2", "0.7", "0.87", "0.55", "0.33", "", ""), V16 = c("", "", "", "", "", "label4", "label4a", "24", "25", "99", "35", "", ""), V17 = c("", "", "", "", "", "", "label4b", "0.3", "0.1", "0.4", "0.2", "", "")), class = "data.frame", row.names = c(NA, -13L))

and like this in the six months blocks,

dta_wide2 <- structure(list(V1 = c("", "", "", "", "", "", "", "peach", "dragonfruit", "honeydew", "huckleberry", "", ""), V2 = c("ABC", "some info", "Store A", "", "As of 31/03/2020", "label1", "", "2", "3", "4", "8", "(a) some useless clutter", "(b) more not relevent information"), V3 = c("", "", "", "", "", "", "label2", "0.1", "0.2", "0.3", "0.8", "", ""), V4 = c("", "", "", "", "", "label4", "label4a", "10", "11", "12", "9", "", ""), V5 = c("", "", "", "", "", "", "label4b", "0.3", "0.1", "0.4", "0.2", "", ""), V6 = c("", "", "", "", "As of 30/06/2020", "label1", "", "4", "6", "8", "16", "", ""), V7 = c("", "", "", "", "", "", "label2", "0.22", "0.33", "0.44", "0.55", "", ""), V8 = c("", "", "", "", "", "label4", "label4a", "11", "12", "13", "10", "", ""), V9 = c("", "", "", "", "", "", "label4b", "0.4", "0.1", "0.3", "0.2", "", "")), class = "data.frame", row.names = c(NA, -13L))

i.e. (for the six months block)

# install.packages(c("tidyverse"), dependencies = TRUE)
library(tidyverse)
dta_wide2 %>% as_tibble
# A tibble: 13 x 9
V1       V2             V3     V4      V5     V6       V7    V8     V9    
<chr>    <chr>          <chr>  <chr>   <chr>  <chr>    <chr> <chr>  <chr> 
   1 ""       "ABC"          ""     ""      ""     ""       ""    ""     ""    
2 ""       "some info"    ""     ""      ""     ""       ""    ""     ""    
3 ""       "Store A"      ""     ""      ""     ""       ""    ""     ""    
4 ""       ""             ""     ""      ""     ""       ""    ""     ""    
5 ""       "As of 31/03/~ ""     ""      ""     "As of ~ ""    ""     ""    
6 ""       "label1"       ""     "label~ ""     "label1" ""    "labe~ ""    
7 ""       ""             "labe~ "label~ "labe~ ""       "lab~ "labe~ "labe~
8 "peach"  "2"            "0.1"  "10"    "0.3"  "4"      "0.2~ "11"   "0.4" 
9 "dragon~ "3"            "0.2"  "11"    "0.1"  "6"      "0.3~ "12"   "0.1" 
10 "honeyd~ "4"            "0.3"  "12"    "0.4"  "8"      "0.4~ "13"   "0.3" 
11 "huckle~ "8"            "0.8"  "9"     "0.2"  "16"     "0.5~ "10"   "0.2" 
12 ""       "(a) some use~ ""     ""      ""     ""       ""    ""     ""    
13 ""       "(b) more not~ ""     ""      ""     ""       ""    ""     ""    

in dta_wide2 the date keys are floting around like this

> dta_wide2[5,] %>% str_sub(start= -10) %>% lubridate::dmy()
[1] NA           "2020-03-31" NA           NA           NA          
[6] "2020-06-30" NA           NA           NA      

so I try to tidy it up like this

dta_wide2 %>% 
   add_column(date1 = dta_wide2[5,2] %>% str_sub(start= -10) %>% lubridate::dmy(), .before = 2)  %>% 
   add_column(date2 = dta_wide2[5,6] %>% str_sub(start= -10) %>% lubridate::dmy(), .before = 6) %>% 
   add_column(store = dta_wide2[3,2], .before = 2) %>% as_tibble

# A tibble: 13 x 12
   V1    store date1      V2    V3    V4    date2      V5    V6    V7   
   <chr> <chr> <date>     <chr> <chr> <chr> <date>     <chr> <chr> <chr>
 1 ""    Stor~ 2020-03-31 "ABC" ""    ""    2020-06-30 ""    ""    ""   
 2 ""    Stor~ 2020-03-31 "som~ ""    ""    2020-06-30 ""    ""    ""   
 3 ""    Stor~ 2020-03-31 "Sto~ ""    ""    2020-06-30 ""    ""    ""   
 4 ""    Stor~ 2020-03-31 ""    ""    ""    2020-06-30 ""    ""    ""   
 5 ""    Stor~ 2020-03-31 "As ~ ""    ""    2020-06-30 ""    "As ~ ""   
 6 ""    Stor~ 2020-03-31 "lab~ ""    "lab~ 2020-06-30 ""    "lab~ ""   
 7 ""    Stor~ 2020-03-31 ""    "lab~ "lab~ 2020-06-30 "lab~ ""    "lab~
 8 "pea~ Stor~ 2020-03-31 "2"   "0.1" "10"  2020-06-30 "0.3" "4"   "0.2~
 9 "dra~ Stor~ 2020-03-31 "3"   "0.2" "11"  2020-06-30 "0.1" "6"   "0.3~
10 "hon~ Stor~ 2020-03-31 "4"   "0.3" "12"  2020-06-30 "0.4" "8"   "0.4~
11 "huc~ Stor~ 2020-03-31 "8"   "0.8" "9"   2020-06-30 "0.2" "16"  "0.5~
12 ""    Stor~ 2020-03-31 "(a)~ ""    ""    2020-06-30 ""    ""    ""   
13 ""    Stor~ 2020-03-31 "(b)~ ""    ""    2020-06-30 ""    ""    ""   
# ... with 2 more variables: V8 <chr>, V9 <chr>

Now, I need to pivot it longer using, if I get it corret, pivot_longer, however my challenge is how -- when I also get data that looks like dta_wide1, i.e. with four quarters -- do I do it in a flexible way that I can use for both dta_wide1 and dta_wide2.

I've been working on this for some time and any help to make it work, simplyfy or clean it up will be very much appriciated.

Here's where I'm currently at, but it’s not right, not flexible, and not simply

dta_wide2_foo <- dta_wide2
names(dta_wide2_foo) <- c('goods', paste0(dta_wide2[6,2:5], dta_wide2[7,2:5], sep = '_1'), paste0(dta_wide2[6,2:5], dta_wide2[7,2:5], sep = '_2'))
dta_wide2_foo %>% 
   add_column(date1 = dta_wide2[5,2] %>% str_sub(start= -10) %>% lubridate::dmy(), .before = 2)  %>% 
   add_column(date2 = dta_wide2[5,6] %>% str_sub(start= -10) %>% lubridate::dmy(), .before = 6) %>% 
   add_column(store = dta_wide2[3,2], .before = 2) %>% as_tibble %>% .[8:11,]  %>%
   pivot_longer(-c(goods, store, date1, date2), values_to = "Value", names_to = "variable") %>% print(n = 100)

Or, some generic snippet, that is not neither simple, smart, or clean, but it could be used to get the positions of the dates in both sample data in a loop

dta <- dta_wide2
dta[5,] %>% str_sub(start= -10) %>% lubridate::dmy() %>% { which(!is.na(.)) }
[1] 2 6

Or, cleaner,

dta <- dta_wide1
dta[5,] %>% grep("As ",.)
[1]  2  6 10 14

update 2020-06-08 07:45:18Z

My goal is to combine the long data sets to able to plot the data, (Wimpel suggest below that I combine my varying wide data sets, i.e. dta_wide1, dta_wide2, ... dta_widen, using a lapply() call) I imagine data that looks something like this,

> dta_long
# A tibble: 96 x 5
   product label   value date       store  
   <chr>   <chr>   <dbl> <date>     <chr>  
 1 peach   label1   7    2019-03-31 Store A
 2 peach   label2   0.5  2019-03-31 Store A
 3 peach   label4a 21    2019-03-31 Store A
 4 peach   label4b  0.3  2019-03-31 Store A
 5 peach   label1   5    2019-06-30 Store A
 6 peach   label2   0.46 2019-06-30 Store A
 7 peach   label4a 19    2019-06-30 Store A
 8 peach   label4b  0.4  2019-06-30 Store A
 9 peach   label1   4    2019-09-30 Store A
10 peach   label2   0.1  2019-09-30 Store A
# ... with 86 more rows

and then ggplot2/plotting the date with something like this,

dta_long %>% filter(label == 'label1') %>% ggplot(aes(date, value, colour = product)) + 
geom_line() + scale_x_date(date_breaks = "3 months", 
date_labels = "%b-%y", limits = c((min(dta_long$date)-34), max = max(dta_long$date)))

enter image description here

like image 728
Eric Fail Avatar asked Jun 02 '20 14:06

Eric Fail


People also ask

What function do we use to pivot our data from multiple columns to fewer while also increasing the number of rows?

pivot_wider() is the opposite of pivot_longer() : it makes a dataset wider by increasing the number of columns and decreasing the number of rows. It's relatively rare to need pivot_wider() to make tidy data, but it's often useful for creating summary tables for presentation, or data in a format needed by other tools.

What is pivoting of data sets?

A pivot transformation is one way to transform data from a tall/skinny format to a short/wide format. The data is distributed into columns usually aggregating the values. This means that multiple values from the original data end up in the same place in the new data table.

How do you pivot a data in Spotfire?

To add a Pivot Data transformation to data that is already loaded into Spotfire: Select Data > Transform data.... In the Data table drop-down list, select the data table you want to add the transformation to. Select Pivot from the drop-down list and click Add....


1 Answers

I think that the following will work for your issue. I make the following assumptions:

  1. Groups in each sheet are grouped according to date and that each group contains four variables
  2. That the values you want to represent are always in rows 8-11 of the spreadsheet
  3. A sheet represents a single store and the store name will in the third row and second column of the spreadsheet
dta_wide1 <- structure(list(V1 = c("", "", "", "", "", "", "", "peach", "dragonfruit", "honeydew", "huckleberry", "", ""), V2 = c("ABC", "some info", "Store A", "", "As of 31/03/2019", "label1", "", "7", "5", "6", "1", "(a) some useless clutter", "(b) more not relevent information"), V3 = c("", "", "", "", "", "", "label2", "0.5", "0.4", "0.8", "0.3", "", ""), V4 = c("", "", "", "", "", "label4", "label4a", "21", "21", "87", "21", "", ""), V5 = c("", "", "", "", "", "", "label4b", "0.3", "0.1", "0.4", "0.2", "", ""), V6 = c("", "", "", "", "As of 30/06/2019", "label1", "", "5", "2", "3", "7", "", ""), V7 = c("", "", "", "", "", "", "label2", "0.46", "0.72", "0.7", "0.8", "", ""), V8 = c("", "", "", "", "", "label4", "label4a", "19", "22", "85", "25", "", ""), V9 = c("", "", "", "", "", "", "label4b", "0.4", "0.1", "0.3", "0.2", "", ""), V10 = c("", "", "", "", "As of 30/09/2019", "label1", "", "4", "1", "4", "8", "", ""), V11 = c("", "", "", "", "", "", "label2", "0.1", "0.3", "0.6", "0.22", "", ""), V12 = c("", "", "", "", "", "label4", "label4a", "21", "23", "71", "27", "", ""), V13 = c("", "", "", "", "", "", "label4b", "0.3", "0.1", "0.4", "0.2", "", ""), V14 = c("", "", "", "", "As of 31/12/2019", "label1", "", "8", "6", "9", "9", "", ""), V15 = c("", "", "", "", "", "", "label2", "0.7", "0.87", "0.55", "0.33", "", ""), V16 = c("", "", "", "", "", "label4", "label4a", "24", "25", "99", "35", "", ""), V17 = c("", "", "", "", "", "", "label4b", "0.3", "0.1", "0.4", "0.2", "", "")), class = "data.frame", row.names = c(NA, -13L))


## Calculate the number of groups of data in the spreadsheet
d1grps <- (ncol(dta_wide1) - 1) / 4 # Divide by 4 due to assumption #1 above

dnew1 <- as_tibble(dta_wide1) %>%
    ## Take rows that contain data (see assumption #2 above)
    slice(8:11) %>% 
    mutate(
        ## Get dates from the original data frame and remove empty cells
        ## (need `unname()` or else this will overwrite variables)
        !!!as.data.frame(slice(dta_wide1, 5) %>% select(seq(2, ncol(.), 4)) %>% unname()),
        ## Get store from second column (see assumption #3 above)
        store = dta_wide1[3, 2] 
    ) %>%
    ## Create variable names for each variable by group
    setNames(
        c("product", paste0(
            c("label1_", "label2_", "label4a_", "label4b_"), 
            rep(1:d1grps, each = 4)
        ),
        paste0("date_", 1:d1grps), "store" 
    )) %>%
    pivot_longer(
        cols = !any_of(c("product", "store")),
        names_pattern = "(.+)_(.+)",
        names_to = c(".value", "group")
    ) %>%
    mutate(date = lubridate::dmy(sub("As of ", "", date)))

And the final tibble in long format:

# A tibble: 16 x 8
   product     store   group label1 label2 label4a label4b date      
   <chr>       <chr>   <chr> <chr>  <chr>  <chr>   <chr>   <date>    
 1 peach       Store A 1     7      0.5    21      0.3     2019-03-31
 2 peach       Store A 2     5      0.46   19      0.4     2019-06-30
 3 peach       Store A 3     4      0.1    21      0.3     2019-09-30
 4 peach       Store A 4     8      0.7    24      0.3     2019-12-31
 5 dragonfruit Store A 1     5      0.4    21      0.1     2019-03-31
 6 dragonfruit Store A 2     2      0.72   22      0.1     2019-06-30
 7 dragonfruit Store A 3     1      0.3    23      0.1     2019-09-30
 8 dragonfruit Store A 4     6      0.87   25      0.1     2019-12-31
 9 honeydew    Store A 1     6      0.8    87      0.4     2019-03-31
10 honeydew    Store A 2     3      0.7    85      0.3     2019-06-30
11 honeydew    Store A 3     4      0.6    71      0.4     2019-09-30
12 honeydew    Store A 4     9      0.55   99      0.4     2019-12-31
13 huckleberry Store A 1     1      0.3    21      0.2     2019-03-31
14 huckleberry Store A 2     7      0.8    25      0.2     2019-06-30
15 huckleberry Store A 3     8      0.22   27      0.2     2019-09-30
16 huckleberry Store A 4     9      0.33   35      0.2     2019-12-31
 

And the plot that you wanted:

ggplot(dnew1, aes(x = date, y = label2, color = product, group = product)) +
    geom_line()

The code should be amenable to an expanding number of dates in each spreadsheet, and you could write a function to programmatically find the number of groups rather than relying on changing the variable d1grps to set the number of groups.

Even though this should work, it might be less of a headache for you to write the contents into a text file and manipulate the text rather than use dplyr in this way.

like image 126
mikebader Avatar answered Oct 12 '22 23:10

mikebader