I am relatively new to R and have data in wide format as follows
subject_id age sex treat1.1.param1 treat1.1.param2 treat1.2.param1 treat1.2.param2
-----------------------------------------------------------------------------------------------
1 23 M 1 2 3 4
2 25 W 5 6 7 8
which is data on several subjects for which we have for a given treatment (here treat1) measures several parameters (here param1 and param2) over multiple rounds of repeated measurements (here round 1 and round 2). The information which treatment, round and parameter the entry for this subject belongs to is coded in the column header as exemplified above.
I would like to have the data in long format exemplified as follows:
subject_id age sex treatment round param1 param2
------------------------------------------------------------------------------------------
1 23 M treat1 1 1 2
1 23 M treat1 2 3 4
2 25 W treat1 1 5 6
2 25 W treat1 2 7 8
That is, the id variable to identify a single observation are subject_id, treatment, round. But since the latter two variables are encoded in the column headers using dots as separators, I don't know how to move from the wide to long format as above. All tries with standard examples (using reshape2
or tidyr
) have failed. Since in reality, I have 12 treatments with each 30 rounds and about 50 parameters per round, a relatively manual way of doing it would not help me too much.
To reshape the dataframe from long to wide in Pandas, we can use Pandas' pd. pivot() method. columns : Column to use to make new frame's columns (e.g., 'Year Month'). values : Column(s) to use for populating new frame's values (e.g., 'Avg.
The easiest way to reshape data between these formats is to use the following two functions from the tidyr package in R: pivot_longer(): Reshapes a data frame from wide to long format. pivot_wider(): Reshapes a data frame from long to wide format.
You want to reshape it to wide format. Press CTRL + SHIFT + ENTER to confirm this formula as it's an array formula. If this formula is entered correctly, you would see the formula inside the curly brackets {}. Column A does not necessarily to be in numeric format.
We can use pivot_longer
from tidyr
specifying the names_to
and names_pattern
argument.
tidyr::pivot_longer(df,
cols = starts_with("treat"),
names_to = c("treatmeant", "round", ".value"),
names_pattern = "(\\w+)\\.(\\d+)\\.(\\w+)")
# subject_id age sex treatmeant round param1 param2
# <int> <int> <fct> <chr> <chr> <int> <int>
#1 1 23 M treat1 1 1 2
#2 1 23 M treat1 2 3 4
#3 2 25 W treat1 1 5 6
#4 2 25 W treat1 2 7 8
data
df <- structure(list(subject_id = 1:2, age = c(23L, 25L), sex = structure(1:2,
.Label = c("M", "W"), class = "factor"),
treat1.1.param1 = c(1L, 5L), treat1.1.param2 = c(2L, 6L),
treat1.2.param1 = c(3L, 7L), treat1.2.param2 = c(4L, 8L)),
class = "data.frame", row.names = c(NA, -2L))
You could use tidyr gather
, separate
and spread
:
tibble::tibble(subject_id = 1:2,
age = c(23,25),
sex = c("M", "W"),
round_1_param_1 = c(1,5),
round_1_param_2 = c(2,6),
round_2_param_1 = c(3,7),
round_2_param_2 = c(4,8)) %>%
tidyr::gather("key", "value", -subject_id, -age, -sex) %>%
tidyr::separate(key, c("round", "param"), sep = "param") %>%
dplyr::mutate_at(vars("round", "param"), ~ tidyr::extract_numeric(.)) %>%
tidyr::spread(param, value)
# A tibble: 4 x 6
subject_id age sex round `1` `2`
<int> <dbl> <chr> <dbl> <dbl> <dbl>
1 1 23 M 1 1 2
2 1 23 M 2 3 4
3 2 25 W 1 5 6
4 2 25 W 2 7 8
Here is a possible data.table
method,
library(data.table)
dcast(melt(dd, id.vars = c("subject_id", "age", 'sex'))
[, .(subject_id, age, sex, gsub('(\\w+)\\.\\d+\\.\\w+', '\\1', variable),
gsub('\\w+\\.(\\d+)\\.\\w+', '\\1', variable),
gsub('\\w+\\.\\d+\\.(\\w+)', '\\1', variable), value)],
subject_id + age + sex + V4 + V5 ~ V6)
which gives,
subject_id age sex V4 V5 param1 param2 1: 1 23 M treat1 1 1 2 2: 1 23 M treat1 2 3 4 3: 2 25 W treat1 1 5 6 4: 2 25 W treat1 2 7 8
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With