Pivoting data with varying width from wide to long with flexible call (to be used in loop)

Tags:

I need to pivot some wide time-series data, that is varying width regard to width using tidyr's pivot_longer().

The data is quarterly, but I receive the data in both years-blocks (with with four quarters) and in six months blocks (with only two quarters), i.e. the data is varying with regard to width.

I would like to find a simple and flexible solution that can be used in a loop, as I need to import many years and six months blocks (and, as I need to convince my research group to use R, I am asking here for a simple, smart, and clean solution using (preferably) tidyverse).

The data looks kinda like this in the years-blocks,

dta_wide1 <- structure(list(V1 = c("", "", "", "", "", "", "", "peach", "dragonfruit", "honeydew", "huckleberry", "", ""), V2 = c("ABC", "some info", "Store A", "", "As of 31/03/2019", "label1", "", "7", "5", "6", "1", "(a) some useless clutter", "(b) more not relevent information"), V3 = c("", "", "", "", "", "", "label2", "0.5", "0.4", "0.8", "0.3", "", ""), V4 = c("", "", "", "", "", "label4", "label4a", "21", "21", "87", "21", "", ""), V5 = c("", "", "", "", "", "", "label4b", "0.3", "0.1", "0.4", "0.2", "", ""), V6 = c("", "", "", "", "As of 30/06/2019", "label1", "", "5", "2", "3", "7", "", ""), V7 = c("", "", "", "", "", "", "label2", "0.46", "0.72", "0.7", "0.8", "", ""), V8 = c("", "", "", "", "", "label4", "label4a", "19", "22", "85", "25", "", ""), V9 = c("", "", "", "", "", "", "label4b", "0.4", "0.1", "0.3", "0.2", "", ""), V10 = c("", "", "", "", "As of 30/09/2019", "label1", "", "4", "1", "4", "8", "", ""), V11 = c("", "", "", "", "", "", "label2", "0.1", "0.3", "0.6", "0.22", "", ""), V12 = c("", "", "", "", "", "label4", "label4a", "21", "23", "71", "27", "", ""), V13 = c("", "", "", "", "", "", "label4b", "0.3", "0.1", "0.4", "0.2", "", ""), V14 = c("", "", "", "", "As of 31/12/2019", "label1", "", "8", "6", "9", "9", "", ""), V15 = c("", "", "", "", "", "", "label2", "0.7", "0.87", "0.55", "0.33", "", ""), V16 = c("", "", "", "", "", "label4", "label4a", "24", "25", "99", "35", "", ""), V17 = c("", "", "", "", "", "", "label4b", "0.3", "0.1", "0.4", "0.2", "", "")), class = "data.frame", row.names = c(NA, -13L))

and like this in the six months blocks,

dta_wide2 <- structure(list(V1 = c("", "", "", "", "", "", "", "peach", "dragonfruit", "honeydew", "huckleberry", "", ""), V2 = c("ABC", "some info", "Store A", "", "As of 31/03/2020", "label1", "", "2", "3", "4", "8", "(a) some useless clutter", "(b) more not relevent information"), V3 = c("", "", "", "", "", "", "label2", "0.1", "0.2", "0.3", "0.8", "", ""), V4 = c("", "", "", "", "", "label4", "label4a", "10", "11", "12", "9", "", ""), V5 = c("", "", "", "", "", "", "label4b", "0.3", "0.1", "0.4", "0.2", "", ""), V6 = c("", "", "", "", "As of 30/06/2020", "label1", "", "4", "6", "8", "16", "", ""), V7 = c("", "", "", "", "", "", "label2", "0.22", "0.33", "0.44", "0.55", "", ""), V8 = c("", "", "", "", "", "label4", "label4a", "11", "12", "13", "10", "", ""), V9 = c("", "", "", "", "", "", "label4b", "0.4", "0.1", "0.3", "0.2", "", "")), class = "data.frame", row.names = c(NA, -13L))

i.e. (for the six months block)

# install.packages(c("tidyverse"), dependencies = TRUE)
library(tidyverse)
dta_wide2 %>% as_tibble
# A tibble: 13 x 9
V1       V2             V3     V4      V5     V6       V7    V8     V9    
<chr>    <chr>          <chr>  <chr>   <chr>  <chr>    <chr> <chr>  <chr> 
   1 ""       "ABC"          ""     ""      ""     ""       ""    ""     ""    
2 ""       "some info"    ""     ""      ""     ""       ""    ""     ""    
3 ""       "Store A"      ""     ""      ""     ""       ""    ""     ""    
4 ""       ""             ""     ""      ""     ""       ""    ""     ""    
5 ""       "As of 31/03/~ ""     ""      ""     "As of ~ ""    ""     ""    
6 ""       "label1"       ""     "label~ ""     "label1" ""    "labe~ ""    
7 ""       ""             "labe~ "label~ "labe~ ""       "lab~ "labe~ "labe~
8 "peach"  "2"            "0.1"  "10"    "0.3"  "4"      "0.2~ "11"   "0.4" 
9 "dragon~ "3"            "0.2"  "11"    "0.1"  "6"      "0.3~ "12"   "0.1" 
10 "honeyd~ "4"            "0.3"  "12"    "0.4"  "8"      "0.4~ "13"   "0.3" 
11 "huckle~ "8"            "0.8"  "9"     "0.2"  "16"     "0.5~ "10"   "0.2" 
12 ""       "(a) some use~ ""     ""      ""     ""       ""    ""     ""    
13 ""       "(b) more not~ ""     ""      ""     ""       ""    ""     ""

in dta_wide2 the date keys are floting around like this

> dta_wide2[5,] %>% str_sub(start= -10) %>% lubridate::dmy()
[1] NA           "2020-03-31" NA           NA           NA          
[6] "2020-06-30" NA           NA           NA

so I try to tidy it up like this

dta_wide2 %>% 
   add_column(date1 = dta_wide2[5,2] %>% str_sub(start= -10) %>% lubridate::dmy(), .before = 2)  %>% 
   add_column(date2 = dta_wide2[5,6] %>% str_sub(start= -10) %>% lubridate::dmy(), .before = 6) %>% 
   add_column(store = dta_wide2[3,2], .before = 2) %>% as_tibble

# A tibble: 13 x 12
   V1    store date1      V2    V3    V4    date2      V5    V6    V7   
   <chr> <chr> <date>     <chr> <chr> <chr> <date>     <chr> <chr> <chr>
 1 ""    Stor~ 2020-03-31 "ABC" ""    ""    2020-06-30 ""    ""    ""   
 2 ""    Stor~ 2020-03-31 "som~ ""    ""    2020-06-30 ""    ""    ""   
 3 ""    Stor~ 2020-03-31 "Sto~ ""    ""    2020-06-30 ""    ""    ""   
 4 ""    Stor~ 2020-03-31 ""    ""    ""    2020-06-30 ""    ""    ""   
 5 ""    Stor~ 2020-03-31 "As ~ ""    ""    2020-06-30 ""    "As ~ ""   
 6 ""    Stor~ 2020-03-31 "lab~ ""    "lab~ 2020-06-30 ""    "lab~ ""   
 7 ""    Stor~ 2020-03-31 ""    "lab~ "lab~ 2020-06-30 "lab~ ""    "lab~
 8 "pea~ Stor~ 2020-03-31 "2"   "0.1" "10"  2020-06-30 "0.3" "4"   "0.2~
 9 "dra~ Stor~ 2020-03-31 "3"   "0.2" "11"  2020-06-30 "0.1" "6"   "0.3~
10 "hon~ Stor~ 2020-03-31 "4"   "0.3" "12"  2020-06-30 "0.4" "8"   "0.4~
11 "huc~ Stor~ 2020-03-31 "8"   "0.8" "9"   2020-06-30 "0.2" "16"  "0.5~
12 ""    Stor~ 2020-03-31 "(a)~ ""    ""    2020-06-30 ""    ""    ""   
13 ""    Stor~ 2020-03-31 "(b)~ ""    ""    2020-06-30 ""    ""    ""   
# ... with 2 more variables: V8 <chr>, V9 <chr>

Now, I need to pivot it longer using, if I get it corret, pivot_longer, however my challenge is how -- when I also get data that looks like dta_wide1, i.e. with four quarters -- do I do it in a flexible way that I can use for both dta_wide1 and dta_wide2.

I've been working on this for some time and any help to make it work, simplyfy or clean it up will be very much appriciated.

Here's where I'm currently at, but it’s not right, not flexible, and not simply

dta_wide2_foo <- dta_wide2
names(dta_wide2_foo) <- c('goods', paste0(dta_wide2[6,2:5], dta_wide2[7,2:5], sep = '_1'), paste0(dta_wide2[6,2:5], dta_wide2[7,2:5], sep = '_2'))
dta_wide2_foo %>% 
   add_column(date1 = dta_wide2[5,2] %>% str_sub(start= -10) %>% lubridate::dmy(), .before = 2)  %>% 
   add_column(date2 = dta_wide2[5,6] %>% str_sub(start= -10) %>% lubridate::dmy(), .before = 6) %>% 
   add_column(store = dta_wide2[3,2], .before = 2) %>% as_tibble %>% .[8:11,]  %>%
   pivot_longer(-c(goods, store, date1, date2), values_to = "Value", names_to = "variable") %>% print(n = 100)

Or, some generic snippet, that is not neither simple, smart, or clean, but it could be used to get the positions of the dates in both sample data in a loop

dta <- dta_wide2
dta[5,] %>% str_sub(start= -10) %>% lubridate::dmy() %>% { which(!is.na(.)) }
[1] 2 6

Or, cleaner,

dta <- dta_wide1
dta[5,] %>% grep("As ",.)
[1]  2  6 10 14

update 2020-06-08 07:45:18Z

My goal is to combine the long data sets to able to plot the data, (Wimpel suggest below that I combine my varying wide data sets, i.e. dta_wide1, dta_wide2, ... dta_widen, using a lapply() call) I imagine data that looks something like this,

> dta_long
# A tibble: 96 x 5
   product label   value date       store  
   <chr>   <chr>   <dbl> <date>     <chr>  
 1 peach   label1   7    2019-03-31 Store A
 2 peach   label2   0.5  2019-03-31 Store A
 3 peach   label4a 21    2019-03-31 Store A
 4 peach   label4b  0.3  2019-03-31 Store A
 5 peach   label1   5    2019-06-30 Store A
 6 peach   label2   0.46 2019-06-30 Store A
 7 peach   label4a 19    2019-06-30 Store A
 8 peach   label4b  0.4  2019-06-30 Store A
 9 peach   label1   4    2019-09-30 Store A
10 peach   label2   0.1  2019-09-30 Store A
# ... with 86 more rows

and then ggplot2/plotting the date with something like this,

dta_long %>% filter(label == 'label1') %>% ggplot(aes(date, value, colour = product)) + 
geom_line() + scale_x_date(date_breaks = "3 months", 
date_labels = "%b-%y", limits = c((min(dta_long$date)-34), max = max(dta_long$date)))

enter image description here

728

asked Jun 02 '20 14:06

Eric Fail

1 Answers

I think that the following will work for your issue. I make the following assumptions:

Groups in each sheet are grouped according to date and that each group contains four variables
That the values you want to represent are always in rows 8-11 of the spreadsheet
A sheet represents a single store and the store name will in the third row and second column of the spreadsheet

dta_wide1 <- structure(list(V1 = c("", "", "", "", "", "", "", "peach", "dragonfruit", "honeydew", "huckleberry", "", ""), V2 = c("ABC", "some info", "Store A", "", "As of 31/03/2019", "label1", "", "7", "5", "6", "1", "(a) some useless clutter", "(b) more not relevent information"), V3 = c("", "", "", "", "", "", "label2", "0.5", "0.4", "0.8", "0.3", "", ""), V4 = c("", "", "", "", "", "label4", "label4a", "21", "21", "87", "21", "", ""), V5 = c("", "", "", "", "", "", "label4b", "0.3", "0.1", "0.4", "0.2", "", ""), V6 = c("", "", "", "", "As of 30/06/2019", "label1", "", "5", "2", "3", "7", "", ""), V7 = c("", "", "", "", "", "", "label2", "0.46", "0.72", "0.7", "0.8", "", ""), V8 = c("", "", "", "", "", "label4", "label4a", "19", "22", "85", "25", "", ""), V9 = c("", "", "", "", "", "", "label4b", "0.4", "0.1", "0.3", "0.2", "", ""), V10 = c("", "", "", "", "As of 30/09/2019", "label1", "", "4", "1", "4", "8", "", ""), V11 = c("", "", "", "", "", "", "label2", "0.1", "0.3", "0.6", "0.22", "", ""), V12 = c("", "", "", "", "", "label4", "label4a", "21", "23", "71", "27", "", ""), V13 = c("", "", "", "", "", "", "label4b", "0.3", "0.1", "0.4", "0.2", "", ""), V14 = c("", "", "", "", "As of 31/12/2019", "label1", "", "8", "6", "9", "9", "", ""), V15 = c("", "", "", "", "", "", "label2", "0.7", "0.87", "0.55", "0.33", "", ""), V16 = c("", "", "", "", "", "label4", "label4a", "24", "25", "99", "35", "", ""), V17 = c("", "", "", "", "", "", "label4b", "0.3", "0.1", "0.4", "0.2", "", "")), class = "data.frame", row.names = c(NA, -13L))


## Calculate the number of groups of data in the spreadsheet
d1grps <- (ncol(dta_wide1) - 1) / 4 # Divide by 4 due to assumption #1 above

dnew1 <- as_tibble(dta_wide1) %>%
    ## Take rows that contain data (see assumption #2 above)
    slice(8:11) %>% 
    mutate(
        ## Get dates from the original data frame and remove empty cells
        ## (need `unname()` or else this will overwrite variables)
        !!!as.data.frame(slice(dta_wide1, 5) %>% select(seq(2, ncol(.), 4)) %>% unname()),
        ## Get store from second column (see assumption #3 above)
        store = dta_wide1[3, 2] 
    ) %>%
    ## Create variable names for each variable by group
    setNames(
        c("product", paste0(
            c("label1_", "label2_", "label4a_", "label4b_"), 
            rep(1:d1grps, each = 4)
        ),
        paste0("date_", 1:d1grps), "store" 
    )) %>%
    pivot_longer(
        cols = !any_of(c("product", "store")),
        names_pattern = "(.+)_(.+)",
        names_to = c(".value", "group")
    ) %>%
    mutate(date = lubridate::dmy(sub("As of ", "", date)))

And the final tibble in long format:

# A tibble: 16 x 8
   product     store   group label1 label2 label4a label4b date      
   <chr>       <chr>   <chr> <chr>  <chr>  <chr>   <chr>   <date>    
 1 peach       Store A 1     7      0.5    21      0.3     2019-03-31
 2 peach       Store A 2     5      0.46   19      0.4     2019-06-30
 3 peach       Store A 3     4      0.1    21      0.3     2019-09-30
 4 peach       Store A 4     8      0.7    24      0.3     2019-12-31
 5 dragonfruit Store A 1     5      0.4    21      0.1     2019-03-31
 6 dragonfruit Store A 2     2      0.72   22      0.1     2019-06-30
 7 dragonfruit Store A 3     1      0.3    23      0.1     2019-09-30
 8 dragonfruit Store A 4     6      0.87   25      0.1     2019-12-31
 9 honeydew    Store A 1     6      0.8    87      0.4     2019-03-31
10 honeydew    Store A 2     3      0.7    85      0.3     2019-06-30
11 honeydew    Store A 3     4      0.6    71      0.4     2019-09-30
12 honeydew    Store A 4     9      0.55   99      0.4     2019-12-31
13 huckleberry Store A 1     1      0.3    21      0.2     2019-03-31
14 huckleberry Store A 2     7      0.8    25      0.2     2019-06-30
15 huckleberry Store A 3     8      0.22   27      0.2     2019-09-30
16 huckleberry Store A 4     9      0.33   35      0.2     2019-12-31

And the plot that you wanted:

ggplot(dnew1, aes(x = date, y = label2, color = product, group = product)) +
    geom_line()

The code should be amenable to an expanding number of dates in each spreadsheet, and you could write a function to programmatically find the number of groups rather than relying on changing the variable d1grps to set the number of groups.

Even though this should work, it might be less of a headache for you to write the contents into a text file and manipulate the text rather than use dplyr in this way.

126

answered Oct 12 '22 23:10

mikebader

Related questions
                            
                                Using a token from `googleAuthR` in `googlesheets`
                            
                                How to use your own image for geom_point in gganimate?
                            
                                Why does as.character drop decimal point?
                            
                                How to run "conda ***" in a system command in R
                            
                                Changing behavior for closure stored in data.table between R 3.4.3 and R 3.6.0
                            
                                Is there a way in R to detect how often a package is used?
                            
                                How to add outer track for circlize plot
                            
                                Fitting a local level Poisson (State Space Model)
                            
                                Unable to install various R packages probably after upgrading to Mac OS Catalina
                            
                                RSelenium concurrent users for multiple scenarios on Shiny
                            
                                Error: package or namespace load failed for ‘data.table’ in library.dynam(lib, package, package.lib): shared object ‘datatable.so’ not found
                            
                                Tidying financial data with mixed decimal and grouping digits
                            
                                How to read data from google drive using R in colab?
                            
                                Is there any explicit guarantee that dplyr operations preserve row order?
                            
                                Comparison of two vectors resulted after simulation
                            
                                Function of function always returns 0-R
                            
                                Why do I get a segfault when calling my C++ function with .Call rather than .C?
                            
                                DiagrammeR - arrow problems
                            
                                R Flatten nested lists of different lengths (Google geocode API output) in R
                            
                                Can't plot ggplot2 objects created with R 3.x into R 4.x imported from a RDS file

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pivoting data with varying width from wide to long with flexible call (to be used in loop)

Tags:

r

time-series

tidyr

tidyverse

update 2020-06-08 07:45:18Z

Eric Fail

People also ask

1 Answers

mikebader

Recent Activity

Donate For Us