I need to pivot some wide time-series data, that is varying width regard to width using tidyr's pivot_longer()
.
The data is quarterly, but I receive the data in both years-blocks (with with four quarters) and in six months blocks (with only two quarters), i.e. the data is varying with regard to width.
I would like to find a simple and flexible solution that can be used in a loop, as I need to import many years and six months blocks (and, as I need to convince my research group to use R, I am asking here for a simple, smart, and clean solution using (preferably) tidyverse).
The data looks kinda like this in the years-blocks,
dta_wide1 <- structure(list(V1 = c("", "", "", "", "", "", "", "peach", "dragonfruit", "honeydew", "huckleberry", "", ""), V2 = c("ABC", "some info", "Store A", "", "As of 31/03/2019", "label1", "", "7", "5", "6", "1", "(a) some useless clutter", "(b) more not relevent information"), V3 = c("", "", "", "", "", "", "label2", "0.5", "0.4", "0.8", "0.3", "", ""), V4 = c("", "", "", "", "", "label4", "label4a", "21", "21", "87", "21", "", ""), V5 = c("", "", "", "", "", "", "label4b", "0.3", "0.1", "0.4", "0.2", "", ""), V6 = c("", "", "", "", "As of 30/06/2019", "label1", "", "5", "2", "3", "7", "", ""), V7 = c("", "", "", "", "", "", "label2", "0.46", "0.72", "0.7", "0.8", "", ""), V8 = c("", "", "", "", "", "label4", "label4a", "19", "22", "85", "25", "", ""), V9 = c("", "", "", "", "", "", "label4b", "0.4", "0.1", "0.3", "0.2", "", ""), V10 = c("", "", "", "", "As of 30/09/2019", "label1", "", "4", "1", "4", "8", "", ""), V11 = c("", "", "", "", "", "", "label2", "0.1", "0.3", "0.6", "0.22", "", ""), V12 = c("", "", "", "", "", "label4", "label4a", "21", "23", "71", "27", "", ""), V13 = c("", "", "", "", "", "", "label4b", "0.3", "0.1", "0.4", "0.2", "", ""), V14 = c("", "", "", "", "As of 31/12/2019", "label1", "", "8", "6", "9", "9", "", ""), V15 = c("", "", "", "", "", "", "label2", "0.7", "0.87", "0.55", "0.33", "", ""), V16 = c("", "", "", "", "", "label4", "label4a", "24", "25", "99", "35", "", ""), V17 = c("", "", "", "", "", "", "label4b", "0.3", "0.1", "0.4", "0.2", "", "")), class = "data.frame", row.names = c(NA, -13L))
and like this in the six months blocks,
dta_wide2 <- structure(list(V1 = c("", "", "", "", "", "", "", "peach", "dragonfruit", "honeydew", "huckleberry", "", ""), V2 = c("ABC", "some info", "Store A", "", "As of 31/03/2020", "label1", "", "2", "3", "4", "8", "(a) some useless clutter", "(b) more not relevent information"), V3 = c("", "", "", "", "", "", "label2", "0.1", "0.2", "0.3", "0.8", "", ""), V4 = c("", "", "", "", "", "label4", "label4a", "10", "11", "12", "9", "", ""), V5 = c("", "", "", "", "", "", "label4b", "0.3", "0.1", "0.4", "0.2", "", ""), V6 = c("", "", "", "", "As of 30/06/2020", "label1", "", "4", "6", "8", "16", "", ""), V7 = c("", "", "", "", "", "", "label2", "0.22", "0.33", "0.44", "0.55", "", ""), V8 = c("", "", "", "", "", "label4", "label4a", "11", "12", "13", "10", "", ""), V9 = c("", "", "", "", "", "", "label4b", "0.4", "0.1", "0.3", "0.2", "", "")), class = "data.frame", row.names = c(NA, -13L))
i.e. (for the six months block)
# install.packages(c("tidyverse"), dependencies = TRUE)
library(tidyverse)
dta_wide2 %>% as_tibble
# A tibble: 13 x 9
V1 V2 V3 V4 V5 V6 V7 V8 V9
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 "" "ABC" "" "" "" "" "" "" ""
2 "" "some info" "" "" "" "" "" "" ""
3 "" "Store A" "" "" "" "" "" "" ""
4 "" "" "" "" "" "" "" "" ""
5 "" "As of 31/03/~ "" "" "" "As of ~ "" "" ""
6 "" "label1" "" "label~ "" "label1" "" "labe~ ""
7 "" "" "labe~ "label~ "labe~ "" "lab~ "labe~ "labe~
8 "peach" "2" "0.1" "10" "0.3" "4" "0.2~ "11" "0.4"
9 "dragon~ "3" "0.2" "11" "0.1" "6" "0.3~ "12" "0.1"
10 "honeyd~ "4" "0.3" "12" "0.4" "8" "0.4~ "13" "0.3"
11 "huckle~ "8" "0.8" "9" "0.2" "16" "0.5~ "10" "0.2"
12 "" "(a) some use~ "" "" "" "" "" "" ""
13 "" "(b) more not~ "" "" "" "" "" "" ""
in dta_wide2
the date keys are floting around like this
> dta_wide2[5,] %>% str_sub(start= -10) %>% lubridate::dmy()
[1] NA "2020-03-31" NA NA NA
[6] "2020-06-30" NA NA NA
so I try to tidy it up like this
dta_wide2 %>%
add_column(date1 = dta_wide2[5,2] %>% str_sub(start= -10) %>% lubridate::dmy(), .before = 2) %>%
add_column(date2 = dta_wide2[5,6] %>% str_sub(start= -10) %>% lubridate::dmy(), .before = 6) %>%
add_column(store = dta_wide2[3,2], .before = 2) %>% as_tibble
# A tibble: 13 x 12
V1 store date1 V2 V3 V4 date2 V5 V6 V7
<chr> <chr> <date> <chr> <chr> <chr> <date> <chr> <chr> <chr>
1 "" Stor~ 2020-03-31 "ABC" "" "" 2020-06-30 "" "" ""
2 "" Stor~ 2020-03-31 "som~ "" "" 2020-06-30 "" "" ""
3 "" Stor~ 2020-03-31 "Sto~ "" "" 2020-06-30 "" "" ""
4 "" Stor~ 2020-03-31 "" "" "" 2020-06-30 "" "" ""
5 "" Stor~ 2020-03-31 "As ~ "" "" 2020-06-30 "" "As ~ ""
6 "" Stor~ 2020-03-31 "lab~ "" "lab~ 2020-06-30 "" "lab~ ""
7 "" Stor~ 2020-03-31 "" "lab~ "lab~ 2020-06-30 "lab~ "" "lab~
8 "pea~ Stor~ 2020-03-31 "2" "0.1" "10" 2020-06-30 "0.3" "4" "0.2~
9 "dra~ Stor~ 2020-03-31 "3" "0.2" "11" 2020-06-30 "0.1" "6" "0.3~
10 "hon~ Stor~ 2020-03-31 "4" "0.3" "12" 2020-06-30 "0.4" "8" "0.4~
11 "huc~ Stor~ 2020-03-31 "8" "0.8" "9" 2020-06-30 "0.2" "16" "0.5~
12 "" Stor~ 2020-03-31 "(a)~ "" "" 2020-06-30 "" "" ""
13 "" Stor~ 2020-03-31 "(b)~ "" "" 2020-06-30 "" "" ""
# ... with 2 more variables: V8 <chr>, V9 <chr>
Now, I need to pivot it longer using, if I get it corret, pivot_longer
, however my challenge is how -- when I also get data that looks like dta_wide1
, i.e. with four quarters -- do I do it in a flexible way that I can use for both dta_wide1
and dta_wide2
.
I've been working on this for some time and any help to make it work, simplyfy or clean it up will be very much appriciated.
Here's where I'm currently at, but it’s not right, not flexible, and not simply
dta_wide2_foo <- dta_wide2
names(dta_wide2_foo) <- c('goods', paste0(dta_wide2[6,2:5], dta_wide2[7,2:5], sep = '_1'), paste0(dta_wide2[6,2:5], dta_wide2[7,2:5], sep = '_2'))
dta_wide2_foo %>%
add_column(date1 = dta_wide2[5,2] %>% str_sub(start= -10) %>% lubridate::dmy(), .before = 2) %>%
add_column(date2 = dta_wide2[5,6] %>% str_sub(start= -10) %>% lubridate::dmy(), .before = 6) %>%
add_column(store = dta_wide2[3,2], .before = 2) %>% as_tibble %>% .[8:11,] %>%
pivot_longer(-c(goods, store, date1, date2), values_to = "Value", names_to = "variable") %>% print(n = 100)
Or, some generic snippet, that is not neither simple, smart, or clean, but it could be used to get the positions of the dates in both sample data in a loop
dta <- dta_wide2
dta[5,] %>% str_sub(start= -10) %>% lubridate::dmy() %>% { which(!is.na(.)) }
[1] 2 6
Or, cleaner,
dta <- dta_wide1
dta[5,] %>% grep("As ",.)
[1] 2 6 10 14
My goal is to combine the long data sets to able to plot the data, (Wimpel suggest below that I combine my varying wide data sets, i.e. dta_wide1
, dta_wide2
, ...
dta_widen
, using a lapply() call) I imagine data that looks something like this,
> dta_long
# A tibble: 96 x 5
product label value date store
<chr> <chr> <dbl> <date> <chr>
1 peach label1 7 2019-03-31 Store A
2 peach label2 0.5 2019-03-31 Store A
3 peach label4a 21 2019-03-31 Store A
4 peach label4b 0.3 2019-03-31 Store A
5 peach label1 5 2019-06-30 Store A
6 peach label2 0.46 2019-06-30 Store A
7 peach label4a 19 2019-06-30 Store A
8 peach label4b 0.4 2019-06-30 Store A
9 peach label1 4 2019-09-30 Store A
10 peach label2 0.1 2019-09-30 Store A
# ... with 86 more rows
and then ggplot2/plotting the date with something like this,
dta_long %>% filter(label == 'label1') %>% ggplot(aes(date, value, colour = product)) +
geom_line() + scale_x_date(date_breaks = "3 months",
date_labels = "%b-%y", limits = c((min(dta_long$date)-34), max = max(dta_long$date)))
pivot_wider() is the opposite of pivot_longer() : it makes a dataset wider by increasing the number of columns and decreasing the number of rows. It's relatively rare to need pivot_wider() to make tidy data, but it's often useful for creating summary tables for presentation, or data in a format needed by other tools.
A pivot transformation is one way to transform data from a tall/skinny format to a short/wide format. The data is distributed into columns usually aggregating the values. This means that multiple values from the original data end up in the same place in the new data table.
To add a Pivot Data transformation to data that is already loaded into Spotfire: Select Data > Transform data.... In the Data table drop-down list, select the data table you want to add the transformation to. Select Pivot from the drop-down list and click Add....
I think that the following will work for your issue. I make the following assumptions:
dta_wide1 <- structure(list(V1 = c("", "", "", "", "", "", "", "peach", "dragonfruit", "honeydew", "huckleberry", "", ""), V2 = c("ABC", "some info", "Store A", "", "As of 31/03/2019", "label1", "", "7", "5", "6", "1", "(a) some useless clutter", "(b) more not relevent information"), V3 = c("", "", "", "", "", "", "label2", "0.5", "0.4", "0.8", "0.3", "", ""), V4 = c("", "", "", "", "", "label4", "label4a", "21", "21", "87", "21", "", ""), V5 = c("", "", "", "", "", "", "label4b", "0.3", "0.1", "0.4", "0.2", "", ""), V6 = c("", "", "", "", "As of 30/06/2019", "label1", "", "5", "2", "3", "7", "", ""), V7 = c("", "", "", "", "", "", "label2", "0.46", "0.72", "0.7", "0.8", "", ""), V8 = c("", "", "", "", "", "label4", "label4a", "19", "22", "85", "25", "", ""), V9 = c("", "", "", "", "", "", "label4b", "0.4", "0.1", "0.3", "0.2", "", ""), V10 = c("", "", "", "", "As of 30/09/2019", "label1", "", "4", "1", "4", "8", "", ""), V11 = c("", "", "", "", "", "", "label2", "0.1", "0.3", "0.6", "0.22", "", ""), V12 = c("", "", "", "", "", "label4", "label4a", "21", "23", "71", "27", "", ""), V13 = c("", "", "", "", "", "", "label4b", "0.3", "0.1", "0.4", "0.2", "", ""), V14 = c("", "", "", "", "As of 31/12/2019", "label1", "", "8", "6", "9", "9", "", ""), V15 = c("", "", "", "", "", "", "label2", "0.7", "0.87", "0.55", "0.33", "", ""), V16 = c("", "", "", "", "", "label4", "label4a", "24", "25", "99", "35", "", ""), V17 = c("", "", "", "", "", "", "label4b", "0.3", "0.1", "0.4", "0.2", "", "")), class = "data.frame", row.names = c(NA, -13L))
## Calculate the number of groups of data in the spreadsheet
d1grps <- (ncol(dta_wide1) - 1) / 4 # Divide by 4 due to assumption #1 above
dnew1 <- as_tibble(dta_wide1) %>%
## Take rows that contain data (see assumption #2 above)
slice(8:11) %>%
mutate(
## Get dates from the original data frame and remove empty cells
## (need `unname()` or else this will overwrite variables)
!!!as.data.frame(slice(dta_wide1, 5) %>% select(seq(2, ncol(.), 4)) %>% unname()),
## Get store from second column (see assumption #3 above)
store = dta_wide1[3, 2]
) %>%
## Create variable names for each variable by group
setNames(
c("product", paste0(
c("label1_", "label2_", "label4a_", "label4b_"),
rep(1:d1grps, each = 4)
),
paste0("date_", 1:d1grps), "store"
)) %>%
pivot_longer(
cols = !any_of(c("product", "store")),
names_pattern = "(.+)_(.+)",
names_to = c(".value", "group")
) %>%
mutate(date = lubridate::dmy(sub("As of ", "", date)))
And the final tibble in long format:
# A tibble: 16 x 8
product store group label1 label2 label4a label4b date
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <date>
1 peach Store A 1 7 0.5 21 0.3 2019-03-31
2 peach Store A 2 5 0.46 19 0.4 2019-06-30
3 peach Store A 3 4 0.1 21 0.3 2019-09-30
4 peach Store A 4 8 0.7 24 0.3 2019-12-31
5 dragonfruit Store A 1 5 0.4 21 0.1 2019-03-31
6 dragonfruit Store A 2 2 0.72 22 0.1 2019-06-30
7 dragonfruit Store A 3 1 0.3 23 0.1 2019-09-30
8 dragonfruit Store A 4 6 0.87 25 0.1 2019-12-31
9 honeydew Store A 1 6 0.8 87 0.4 2019-03-31
10 honeydew Store A 2 3 0.7 85 0.3 2019-06-30
11 honeydew Store A 3 4 0.6 71 0.4 2019-09-30
12 honeydew Store A 4 9 0.55 99 0.4 2019-12-31
13 huckleberry Store A 1 1 0.3 21 0.2 2019-03-31
14 huckleberry Store A 2 7 0.8 25 0.2 2019-06-30
15 huckleberry Store A 3 8 0.22 27 0.2 2019-09-30
16 huckleberry Store A 4 9 0.33 35 0.2 2019-12-31
And the plot that you wanted:
ggplot(dnew1, aes(x = date, y = label2, color = product, group = product)) +
geom_line()
The code should be amenable to an expanding number of dates in each spreadsheet, and you could write a function to programmatically find the number of groups rather than relying on changing the variable d1grps
to set the number of groups.
Even though this should work, it might be less of a headache for you to write the contents into a text file and manipulate the text rather than use dplyr
in this way.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With