Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

parallel pivot_longer of two sets of columns

I have the following data frame:

library(tidyverse)
dat <- tribble(
  ~Scenario,     ~V1,    ~V2,    ~V3,    ~V4,
  1,    0.97,   0.46,   0.79,   0.25,
  1,    0.21,   0.45,   0.23,   0.63,
  1,    0.95,   0.97,   0.07,   0.61,
  1,    0.93,   0.79,   0.23,   0.86,
  2,    0.22,   0.01,   0.42,   0.47,
  2,    0.71,   0.17,   0.16,   0.88,
  3,    0.73,   0.38,   0.10,   0.77,
  3,    0.49,   0.37,   0.90,   0.52,
  3,    0.99,   0.71,   0.66,   0.05,
  3,    0.72,   0.75,   0.69,   0.01,
  3,    0.15,   0.87,   0.12,   0.02,
  4,    0.94,   0.30,   0.91,   0.99)

I'm adding four new columns to this data, where each new column represents the sum of each V1:V4 column grouped by Scenario:

dat_new <- dat %>%
  group_by(Scenario) %>%
  mutate_at(vars(-group_cols()), .funs = list(sum = sum))

And now I want to pivot this data into long format, where one set are my V1:V4 columns and the second set are my V1_sum:V4_sum columns. The normal pivot_longer doesn't work because it only accepts one value column, however, I need two.

I found a potential solution in the tidyverse reference (example at the very bottom), but I'm failing with defining the correct names_pattern. https://tidyr.tidyverse.org/reference/pivot_longer.html

Is there a simple and straightforward way (ideally tidyverse) where I can just do these two pivot_longer sets? It feels that this is one of the basic tasks when reshaping data sets, but I'm unable to get it working.

Expected output:

Scenario    set    V    sum
1           1      0.97 3.06
1           2      0.46 2.67
1           3      0.79 1.32
1           4      0.25 2.35
...
4           4      0.99 0.99

Note: the column names "set", "V", and "sum" are just examples, and I'm fine with other column names if they are easier to generate on the fly.

like image 780
deschen Avatar asked Dec 09 '19 17:12

deschen


3 Answers

I found this solution inside vignette("pivot") under the heading "Multiple Observations Per Row".

You need a parallel naming structure for each set of variables, and you need a separator.

For users looking for a general solution, jump to the "General solution" section below.

Preparing asker's example data

In your case (which is not very general), coming out of the mutate_at function, you get _sum attached to each variable. But you need something after the underscore for the first four variables. I renamed your original four variables with tag _orig and rather inelegantly simplified the names of the sum variables from V1_orig_sum to V1_sum to avoid the double underscores.

dat <- tribble(
  ~Scenario,     ~V1_orig,    ~V2_orig,    ~V3_orig,    ~V4_orig,
  1,    0.97,   0.46,   0.79,   0.25,
  1,    0.21,   0.45,   0.23,   0.63,
  1,    0.95,   0.97,   0.07,   0.61,
  1,    0.93,   0.79,   0.23,   0.86,
  2,    0.22,   0.01,   0.42,   0.47,
  2,    0.71,   0.17,   0.16,   0.88,
  3,    0.73,   0.38,   0.10,   0.77,
  3,    0.49,   0.37,   0.90,   0.52,
  3,    0.99,   0.71,   0.66,   0.05,
  3,    0.72,   0.75,   0.69,   0.01,
  3,    0.15,   0.87,   0.12,   0.02,
  4,    0.94,   0.30,   0.91,   0.99)

dat_new <- dat %>%
  group_by(Scenario) %>%
  mutate_at(vars(-group_cols()), .funs = list(sum = sum)) %>%
  rename_with(.cols=ends_with('_sum'), .fn=str_remove, pattern='_orig')

After these manipulations, your data are in a format appropriate for answering how to pivot_longer() two sets of columns (as shown in the next section).

General solution

We start with a data frame dat_new that has one ID variable (Scenario) and two sets of four columns each:

> head(dat_new)
# A tibble: 6 x 9
# Groups:   Scenario [2]
  Scenario V1_orig V2_orig V3_orig V4_orig V1_sum V2_sum V3_sum V4_sum
     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1        1    0.97    0.46    0.79    0.25   3.06   2.67   1.32   2.35
2        1    0.21    0.45    0.23    0.63   3.06   2.67   1.32   2.35
3        1    0.95    0.97    0.07    0.61   3.06   2.67   1.32   2.35
4        1    0.93    0.79    0.23    0.86   3.06   2.67   1.32   2.35
5        2    0.22    0.01    0.42    0.47   0.93   0.18   0.58   1.35
6        2    0.71    0.17    0.16    0.88   0.93   0.18   0.58   1.35

The column names have the form [row]_[column], where [row] and [column] describe where we want the component to show up in the final data frame.

Call pivot_longer but give it two column names in the names_to argument: One for what you are calling these objects (Var in this case), and ".value" to indicate that there should be as many new columns are there are unique tags after the _ separator:

dat_new %>% 
  pivot_longer(-Scenario, 
               names_to = c("Var", ".value"), 
               names_sep="_" )
# A tibble: 48 x 4
# Groups:   Scenario [4]
   Scenario Var    orig   sum
      <dbl> <chr> <dbl> <dbl>
 1        1 V1     0.97  3.06
 2        1 V2     0.46  2.67
 3        1 V3     0.79  1.32
 4        1 V4     0.25  2.35
 5        1 V1     0.21  3.06
 6        1 V2     0.45  2.67
 7        1 V3     0.23  1.32
 8        1 V4     0.63  2.35
 9        1 V1     0.95  3.06
10        1 V2     0.97  2.67
# ... with 38 more rows

Note if you had variables with the naming system in reverse (i.e. orig_V1, sum_V1, etc..) you can reverse the order in the naming argument: names_to=c(".value", "Var). This solution does not require that the values be the same type (i.e. both orig and sum are numeric here, but one could have been numeric and one a character).

like image 114
emudrak Avatar answered Oct 18 '22 02:10

emudrak


Maybe you should try to pivot before computing the sum:

dat %>% pivot_longer(., -Scenario, names_to = "V", values_to = "Value") %>% 
  group_by(Scenario, V) %>%
  mutate(Sum = sum(Value))


# A tibble: 48 x 4
# Groups:   Scenario, V [16]
   Scenario V     Value   Sum
      <dbl> <chr> <dbl> <dbl>
 1        1 V1     0.97  3.06
 2        1 V2     0.46  2.67
 3        1 V3     0.79  1.32
 4        1 V4     0.25  2.35
 5        1 V1     0.21  3.06
 6        1 V2     0.45  2.67
 7        1 V3     0.23  1.32
 8        1 V4     0.63  2.35
 9        1 V1     0.95  3.06
10        1 V2     0.97  2.67
# … with 38 more rows

Is it what you are looking for ?

like image 3
dc37 Avatar answered Oct 18 '22 01:10

dc37


It was surprisingly difficult. I think perhaps the simplest way is to convert the data to the longest format, then restructure into 2 sets after this. Here's an example with Anscombe's quartet as given in R:

> anscombe
   x1 x2 x3 x4   y1  y2   y3   y4
1  10 10 10  8  8.0 9.1  7.5  6.6
2   8  8  8  8  7.0 8.1  6.8  5.8
3  13 13 13  8  7.6 8.7 12.7  7.7
4   9  9  9  8  8.8 8.8  7.1  8.8
5  11 11 11  8  8.3 9.3  7.8  8.5
6  14 14 14  8 10.0 8.1  8.8  7.0
7   6  6  6  8  7.2 6.1  6.1  5.2
8   4  4  4 19  4.3 3.1  5.4 12.5
9  12 12 12  8 10.8 9.1  8.2  5.6
10  7  7  7  8  4.8 7.3  6.4  7.9
11  5  5  5  8  5.7 4.7  5.7  6.9

So we see there are 2 sets of variables, 4 each. We want a dataset that is just 3 variables: x, y, and series, the latter which just indicates which set of the 4 the values belong to. Here's how I did it:

> anscombe %>% 
+   mutate(row = 1:n()) %>% 
+   pivot_longer(cols = -row) %>% 
+   separate(col = name, into = c("var", "series"), sep = 1) %>% 
+   pivot_wider(id_cols = c(row, series), names_from = "var", values_from = "value") %>% print(n=Inf)
# A tibble: 44 × 4
     row series     x     y
   <int> <chr>  <dbl> <dbl>
 1     1 1         10  8.04
 2     1 2         10  9.14
 3     1 3         10  7.46
 4     1 4          8  6.58
 5     2 1          8  6.95
 6     2 2          8  8.14
 7     2 3          8  6.77
 8     2 4          8  5.76
 9     3 1         13  7.58
10     3 2         13  8.74
11     3 3         13 12.7 
12     3 4          8  7.71
13     4 1          9  8.81
14     4 2          9  8.77
15     4 3          9  7.11
16     4 4          8  8.84
17     5 1         11  8.33
18     5 2         11  9.26
19     5 3         11  7.81
20     5 4          8  8.47
21     6 1         14  9.96
22     6 2         14  8.1 
23     6 3         14  8.84
24     6 4          8  7.04
25     7 1          6  7.24
26     7 2          6  6.13
27     7 3          6  6.08
28     7 4          8  5.25
29     8 1          4  4.26
30     8 2          4  3.1 
31     8 3          4  5.39
32     8 4         19 12.5 
33     9 1         12 10.8 
34     9 2         12  9.13
35     9 3         12  8.15
36     9 4          8  5.56
37    10 1          7  4.82
38    10 2          7  7.26
39    10 3          7  6.42
40    10 4          8  7.91
41    11 1          5  5.68
42    11 2          5  4.74
43    11 3          5  5.73
44    11 4          8  6.89

Step by step:

  1. Add an id column that is just the row number.
  2. Pivot everything to long format except for the id.
  3. Separate the x1...x4 and y1...y4 into two columns.
  4. Restore to wide format with 2 columns for x and y.

We can plot the data to make sure we did it right:

#data from above)
ggplot(aes(x, y)) +
geom_point() +
geom_smooth(method = "lm", se = F) +
facet_wrap("series") +
ggtitle("Anscombe's quartlet: Same correlations but different data")

enter image description here

like image 1
CoderGuy123 Avatar answered Oct 18 '22 02:10

CoderGuy123