I am relatively new to R and have data in wide format as follows <pre class="prettyprint"><code>subject_id age sex treat1.1.param1 treat1.1.param2 treat1.2.param1 treat1.2.param2 ----------------------------------------------------------------------------------------------- 1 23 M 1 2 3 4 2 25 W 5 6 7 8 </code></pre> which is data on several subjects for which we have for a given treatment (here treat1) measures several parameters (here param1 and param2) over multiple rounds of repeated measurements (here round 1 and round 2). The information which treatment, round and parameter the entry for this subject belongs to is coded in the column header as exemplified above. I would like to have the data in long format exemplified as follows: <pre class="prettyprint"><code>subject_id age sex treatment round param1 param2 ------------------------------------------------------------------------------------------ 1 23 M treat1 1 1 2 1 23 M treat1 2 3 4 2 25 W treat1 1 5 6 2 25 W treat1 2 7 8 </code></pre> That is, the id variable to identify a single observation are subject_id, treatment, round. But since the latter two variables are encoded in the column headers using dots as separators, I don't know how to move from the wide to long format as above. All tries with standard examples (using <code>reshape2</code> or <code>tidyr</code>) have failed. Since in reality, I have 12 treatments with each 30 rounds and about 50 parameters per round, a relatively manual way of doing it would not help me too much.

We can use <code>pivot_longer</code> from <code>tidyr</code> specifying the <code>names_to</code> and <code>names_pattern</code> argument. <pre class="prettyprint"><code>tidyr::pivot_longer(df, cols = starts_with("treat"), names_to = c("treatmeant", "round", ".value"), names_pattern = "(\\w+)\\.(\\d+)\\.(\\w+)") # subject_id age sex treatmeant round param1 param2 # <int> <int> <fct> <chr> <chr> <int> <int> #1 1 23 M treat1 1 1 2 #2 1 23 M treat1 2 3 4 #3 2 25 W treat1 1 5 6 #4 2 25 W treat1 2 7 8 </code></pre> data <pre class="prettyprint"><code>df <- structure(list(subject_id = 1:2, age = c(23L, 25L), sex = structure(1:2, .Label = c("M", "W"), class = "factor"), treat1.1.param1 = c(1L, 5L), treat1.1.param2 = c(2L, 6L), treat1.2.param1 = c(3L, 7L), treat1.2.param2 = c(4L, 8L)), class = "data.frame", row.names = c(NA, -2L)) </code></pre>

You could use tidyr <code>gather</code>, <code>separate</code> and <code>spread</code>: <pre class="prettyprint"><code>tibble::tibble(subject_id = 1:2, age = c(23,25), sex = c("M", "W"), round_1_param_1 = c(1,5), round_1_param_2 = c(2,6), round_2_param_1 = c(3,7), round_2_param_2 = c(4,8)) %>% tidyr::gather("key", "value", -subject_id, -age, -sex) %>% tidyr::separate(key, c("round", "param"), sep = "param") %>% dplyr::mutate_at(vars("round", "param"), ~ tidyr::extract_numeric(.)) %>% tidyr::spread(param, value) # A tibble: 4 x 6 subject_id age sex round `1` `2` <int> <dbl> <chr> <dbl> <dbl> <dbl> 1 1 23 M 1 1 2 2 1 23 M 2 3 4 3 2 25 W 1 5 6 4 2 25 W 2 7 8 </code></pre>

Converting data from wide to long format when id variables are encoded in column header [duplicate]

Tags:

r

tidyr

reshape2

I am relatively new to R and have data in wide format as follows

subject_id   age    sex  treat1.1.param1    treat1.1.param2   treat1.2.param1   treat1.2.param2
-----------------------------------------------------------------------------------------------
1             23     M         1                  2                  3                   4
2             25     W         5                  6                  7                   8

which is data on several subjects for which we have for a given treatment (here treat1) measures several parameters (here param1 and param2) over multiple rounds of repeated measurements (here round 1 and round 2). The information which treatment, round and parameter the entry for this subject belongs to is coded in the column header as exemplified above.

I would like to have the data in long format exemplified as follows:

subject_id  age sex treatment   round       param1      param2
------------------------------------------------------------------------------------------
1           23   M   treat1      1           1          2
1           23   M   treat1      2           3          4
2           25   W   treat1      1           5          6
2           25   W   treat1      2           7          8

That is, the id variable to identify a single observation are subject_id, treatment, round. But since the latter two variables are encoded in the column headers using dots as separators, I don't know how to move from the wide to long format as above. All tries with standard examples (using reshape2 or tidyr) have failed. Since in reality, I have 12 treatments with each 30 rounds and about 50 parameters per round, a relatively manual way of doing it would not help me too much.

468

asked Jan 24 '20 07:01

Jan

3 Answers

We can use pivot_longer from tidyr specifying the names_to and names_pattern argument.

tidyr::pivot_longer(df, 
                    cols = starts_with("treat"), 
                    names_to = c("treatmeant", "round", ".value"), 
                    names_pattern =  "(\\w+)\\.(\\d+)\\.(\\w+)")

#  subject_id   age sex   treatmeant round param1 param2
#       <int> <int> <fct> <chr>      <chr>  <int>  <int>
#1          1    23 M     treat1     1          1      2
#2          1    23 M     treat1     2          3      4
#3          2    25 W     treat1     1          5      6
#4          2    25 W     treat1     2          7      8

data

df <- structure(list(subject_id = 1:2, age = c(23L, 25L), sex = structure(1:2, 
.Label = c("M", "W"), class = "factor"), 
treat1.1.param1 = c(1L, 5L), treat1.1.param2 = c(2L, 6L), 
treat1.2.param1 = c(3L, 7L), treat1.2.param2 = c(4L, 8L)), 
class = "data.frame", row.names = c(NA, -2L))

116

answered Oct 16 '22 10:10

Ronak Shah

You could use tidyr gather, separate and spread:

tibble::tibble(subject_id = 1:2,
               age = c(23,25),
               sex = c("M", "W"),
               round_1_param_1 = c(1,5),
               round_1_param_2 = c(2,6),
               round_2_param_1 = c(3,7),
               round_2_param_2 = c(4,8)) %>% 
  tidyr::gather("key", "value", -subject_id, -age, -sex) %>% 
  tidyr::separate(key, c("round", "param"), sep = "param") %>%
  dplyr::mutate_at(vars("round", "param"), ~ tidyr::extract_numeric(.)) %>% 
  tidyr::spread(param, value)

# A tibble: 4 x 6
  subject_id   age sex   round   `1`   `2`
       <int> <dbl> <chr> <dbl> <dbl> <dbl>
1          1    23 M         1     1     2
2          1    23 M         2     3     4
3          2    25 W         1     5     6
4          2    25 W         2     7     8

answered Oct 16 '22 10:10

Florian

Here is a possible data.table method,

library(data.table)

dcast(melt(dd, id.vars = c("subject_id", "age", 'sex'))
      [, .(subject_id, age, sex, gsub('(\\w+)\\.\\d+\\.\\w+', '\\1', variable),
                                 gsub('\\w+\\.(\\d+)\\.\\w+', '\\1', variable),
                                 gsub('\\w+\\.\\d+\\.(\\w+)', '\\1', variable), value)],
      subject_id + age + sex + V4 + V5 ~ V6)

which gives,

   subject_id age sex     V4 V5 param1 param2
1:          1  23   M treat1  1      1      2
2:          1  23   M treat1  2      3      4
3:          2  25   W treat1  1      5      6
4:          2  25   W treat1  2      7      8

answered Oct 16 '22 08:10

Sotos

Related questions
                            
                                Compact a data frame by removing some of the NA cells?
                            
                                use sprintf with a vector rather than a variable number of arguments in R
                            
                                DT Shiny different custom column header by column
                            
                                Find duplicate rows in data frame based on multiple columns in r
                            
                                How to change one specific facet in ggplot
                            
                                Change tick mark labels to specific strings in plotly
                            
                                Convert sets of spatial coordinates to polygons in R using sf
                            
                                How to remove white spaces between stacked geom_col
                            
                                R ggplot2: change colour of font and background in facet strip?
                            
                                Split column by multiple delimiters, keeping delimiters
                            
                                changing all values in one column in a filtered data.frame in R
                            
                                Using ggsave with a pipe
                            
                                R: Using pipe %>% and pkg::fo leads to error "Error in .::base : unused argument"
                            
                                How to output values of R variables in an inline LateX equation in R Markdown (i.e. dynamically updated)
                            
                                How to extract every ggplot2 plot from a nested list
                            
                                Creating new vector that represents the count
                            
                                Pandoc error 1033 when rendering multiple Rmarkdown reports
                            
                                How to loop through columns, check if a particular value exists in any of the columns, mutate a new column and enter 1 if it exists, 0 if not?
                            
                                Replace part of string with mutate (in a pipe)
                            
                                Plotting one variable both line-only and points-only, depending on value

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With