I have the following data frame:
library(tidyverse)
dat <- tribble(
~Scenario, ~V1, ~V2, ~V3, ~V4,
1, 0.97, 0.46, 0.79, 0.25,
1, 0.21, 0.45, 0.23, 0.63,
1, 0.95, 0.97, 0.07, 0.61,
1, 0.93, 0.79, 0.23, 0.86,
2, 0.22, 0.01, 0.42, 0.47,
2, 0.71, 0.17, 0.16, 0.88,
3, 0.73, 0.38, 0.10, 0.77,
3, 0.49, 0.37, 0.90, 0.52,
3, 0.99, 0.71, 0.66, 0.05,
3, 0.72, 0.75, 0.69, 0.01,
3, 0.15, 0.87, 0.12, 0.02,
4, 0.94, 0.30, 0.91, 0.99)
I'm adding four new columns to this data, where each new column represents the sum of each V1:V4 column grouped by Scenario
:
dat_new <- dat %>%
group_by(Scenario) %>%
mutate_at(vars(-group_cols()), .funs = list(sum = sum))
And now I want to pivot this data into long format, where one set are my V1:V4 columns and the second set are my V1_sum:V4_sum columns. The normal pivot_longer
doesn't work because it only accepts one value column, however, I need two.
I found a potential solution in the tidyverse reference (example at the very bottom), but I'm failing with defining the correct names_pattern
. https://tidyr.tidyverse.org/reference/pivot_longer.html
Is there a simple and straightforward way (ideally tidyverse) where I can just do these two pivot_longer sets? It feels that this is one of the basic tasks when reshaping data sets, but I'm unable to get it working.
Expected output:
Scenario set V sum
1 1 0.97 3.06
1 2 0.46 2.67
1 3 0.79 1.32
1 4 0.25 2.35
...
4 4 0.99 0.99
Note: the column names "set", "V", and "sum" are just examples, and I'm fine with other column names if they are easier to generate on the fly.
I found this solution inside vignette("pivot")
under the heading "Multiple Observations Per Row".
You need a parallel naming structure for each set of variables, and you need a separator.
For users looking for a general solution, jump to the "General solution" section below.
Preparing asker's example data
In your case (which is not very general), coming out of the mutate_at
function, you get _sum
attached to each variable. But you need something after the underscore for the first four variables. I renamed your original four variables with tag _orig
and rather inelegantly simplified the names of the sum variables from V1_orig_sum
to V1_sum
to avoid the double underscores.
dat <- tribble(
~Scenario, ~V1_orig, ~V2_orig, ~V3_orig, ~V4_orig,
1, 0.97, 0.46, 0.79, 0.25,
1, 0.21, 0.45, 0.23, 0.63,
1, 0.95, 0.97, 0.07, 0.61,
1, 0.93, 0.79, 0.23, 0.86,
2, 0.22, 0.01, 0.42, 0.47,
2, 0.71, 0.17, 0.16, 0.88,
3, 0.73, 0.38, 0.10, 0.77,
3, 0.49, 0.37, 0.90, 0.52,
3, 0.99, 0.71, 0.66, 0.05,
3, 0.72, 0.75, 0.69, 0.01,
3, 0.15, 0.87, 0.12, 0.02,
4, 0.94, 0.30, 0.91, 0.99)
dat_new <- dat %>%
group_by(Scenario) %>%
mutate_at(vars(-group_cols()), .funs = list(sum = sum)) %>%
rename_with(.cols=ends_with('_sum'), .fn=str_remove, pattern='_orig')
After these manipulations, your data are in a format appropriate for answering how to pivot_longer()
two sets of columns (as shown in the next section).
General solution
We start with a data frame dat_new
that has one ID variable (Scenario
) and two sets of four columns each:
> head(dat_new)
# A tibble: 6 x 9
# Groups: Scenario [2]
Scenario V1_orig V2_orig V3_orig V4_orig V1_sum V2_sum V3_sum V4_sum
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0.97 0.46 0.79 0.25 3.06 2.67 1.32 2.35
2 1 0.21 0.45 0.23 0.63 3.06 2.67 1.32 2.35
3 1 0.95 0.97 0.07 0.61 3.06 2.67 1.32 2.35
4 1 0.93 0.79 0.23 0.86 3.06 2.67 1.32 2.35
5 2 0.22 0.01 0.42 0.47 0.93 0.18 0.58 1.35
6 2 0.71 0.17 0.16 0.88 0.93 0.18 0.58 1.35
The column names have the form [row]_[column]
, where [row]
and [column]
describe where we want the component to show up in the final data frame.
Call pivot_longer
but give it two column names in the names_to
argument: One for what you are calling these objects (Var
in this case), and ".value"
to indicate that there should be as many new columns are there are unique tags after the _
separator:
dat_new %>%
pivot_longer(-Scenario,
names_to = c("Var", ".value"),
names_sep="_" )
# A tibble: 48 x 4
# Groups: Scenario [4]
Scenario Var orig sum
<dbl> <chr> <dbl> <dbl>
1 1 V1 0.97 3.06
2 1 V2 0.46 2.67
3 1 V3 0.79 1.32
4 1 V4 0.25 2.35
5 1 V1 0.21 3.06
6 1 V2 0.45 2.67
7 1 V3 0.23 1.32
8 1 V4 0.63 2.35
9 1 V1 0.95 3.06
10 1 V2 0.97 2.67
# ... with 38 more rows
Note if you had variables with the naming system in reverse (i.e. orig_V1
, sum_V1
, etc..) you can reverse the order in the naming argument: names_to=c(".value", "Var)
. This solution does not require that the values be the same type (i.e. both orig and sum are numeric here, but one could have been numeric and one a character).
Maybe you should try to pivot before computing the sum:
dat %>% pivot_longer(., -Scenario, names_to = "V", values_to = "Value") %>%
group_by(Scenario, V) %>%
mutate(Sum = sum(Value))
# A tibble: 48 x 4
# Groups: Scenario, V [16]
Scenario V Value Sum
<dbl> <chr> <dbl> <dbl>
1 1 V1 0.97 3.06
2 1 V2 0.46 2.67
3 1 V3 0.79 1.32
4 1 V4 0.25 2.35
5 1 V1 0.21 3.06
6 1 V2 0.45 2.67
7 1 V3 0.23 1.32
8 1 V4 0.63 2.35
9 1 V1 0.95 3.06
10 1 V2 0.97 2.67
# … with 38 more rows
Is it what you are looking for ?
It was surprisingly difficult. I think perhaps the simplest way is to convert the data to the longest format, then restructure into 2 sets after this. Here's an example with Anscombe's quartet as given in R:
> anscombe
x1 x2 x3 x4 y1 y2 y3 y4
1 10 10 10 8 8.0 9.1 7.5 6.6
2 8 8 8 8 7.0 8.1 6.8 5.8
3 13 13 13 8 7.6 8.7 12.7 7.7
4 9 9 9 8 8.8 8.8 7.1 8.8
5 11 11 11 8 8.3 9.3 7.8 8.5
6 14 14 14 8 10.0 8.1 8.8 7.0
7 6 6 6 8 7.2 6.1 6.1 5.2
8 4 4 4 19 4.3 3.1 5.4 12.5
9 12 12 12 8 10.8 9.1 8.2 5.6
10 7 7 7 8 4.8 7.3 6.4 7.9
11 5 5 5 8 5.7 4.7 5.7 6.9
So we see there are 2 sets of variables, 4 each. We want a dataset that is just 3 variables: x, y, and series, the latter which just indicates which set of the 4 the values belong to. Here's how I did it:
> anscombe %>%
+ mutate(row = 1:n()) %>%
+ pivot_longer(cols = -row) %>%
+ separate(col = name, into = c("var", "series"), sep = 1) %>%
+ pivot_wider(id_cols = c(row, series), names_from = "var", values_from = "value") %>% print(n=Inf)
# A tibble: 44 × 4
row series x y
<int> <chr> <dbl> <dbl>
1 1 1 10 8.04
2 1 2 10 9.14
3 1 3 10 7.46
4 1 4 8 6.58
5 2 1 8 6.95
6 2 2 8 8.14
7 2 3 8 6.77
8 2 4 8 5.76
9 3 1 13 7.58
10 3 2 13 8.74
11 3 3 13 12.7
12 3 4 8 7.71
13 4 1 9 8.81
14 4 2 9 8.77
15 4 3 9 7.11
16 4 4 8 8.84
17 5 1 11 8.33
18 5 2 11 9.26
19 5 3 11 7.81
20 5 4 8 8.47
21 6 1 14 9.96
22 6 2 14 8.1
23 6 3 14 8.84
24 6 4 8 7.04
25 7 1 6 7.24
26 7 2 6 6.13
27 7 3 6 6.08
28 7 4 8 5.25
29 8 1 4 4.26
30 8 2 4 3.1
31 8 3 4 5.39
32 8 4 19 12.5
33 9 1 12 10.8
34 9 2 12 9.13
35 9 3 12 8.15
36 9 4 8 5.56
37 10 1 7 4.82
38 10 2 7 7.26
39 10 3 7 6.42
40 10 4 8 7.91
41 11 1 5 5.68
42 11 2 5 4.74
43 11 3 5 5.73
44 11 4 8 6.89
Step by step:
We can plot the data to make sure we did it right:
#data from above)
ggplot(aes(x, y)) +
geom_point() +
geom_smooth(method = "lm", se = F) +
facet_wrap("series") +
ggtitle("Anscombe's quartlet: Same correlations but different data")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With