Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Transform dataframe to long format based on text

Tags:

r

tidyr

I have a dataset from a survey. The setup of the survey was as follows: Q1 was a multiple choice question, and in Q2 answers are given for each of the selected choices in Q1 (same for Q3 up to Q9). After exporting and uploading to R, the dataframe is as follows:

df <- data.frame(
  ID = 1:3,
  Q1 = c("Purple, Red, Blue", "Red, Blue, Yellow", "Red"),
  "Q2: Is it your favorite color? - Purple" = c("Average", NA, NA),
  "Q2: Is it your favorite color? - Red" = c("No", "No", "Average"),
  "Q2: Is it your favorite color? - Blue" = c("Yes", "No", NA),
  "Q2: Is it your favorite color? - Yellow" = c(NA, "Average", NA)
)

What I'm looking to do is to transform so that the dataframe is a long format like below

ID Q1 Q2
1 Purple Average
1 Red No
1 Blue Yes
2 Red No
2 Blue No
2 Yellow Average
3 Red Average

I was thinking to first separate the answers to Q1 into rows, and then pivot_longer the Q2 columns based on the selected option in Q2.

df1 <- df %>% 
  separate_rows(Q1, sep = ",")

df2 <- df1 %>% 
  pivot_longer(cols = matches(Q1),
               names_to = Q2,
               values_to = Q2_answer)

But this doesn't work (error in pivot_longer: selections can't have missing values).

What am I doing wrong? Or is there another way to do this?

like image 502
Ekd Avatar asked Dec 22 '25 06:12

Ekd


2 Answers

Probably this could help

df %>%
  select(-Q1) %>%
  pivot_longer(-ID, names_to = "Q1", values_to = "Q2") %>%
  na.omit() %>%
  mutate(Q1 = str_extract(Q1, "\\w+$"))

or with values_drop_na = TRUE (as suggested by M-- in the comment)

df %>%
  select(-Q1) %>%
  pivot_longer(-ID, names_to = "Q1", values_to = "Q2", values_drop_na = TRUE) %>%
  mutate(Q1 = str_extract(Q1, "\\w+$"))

which gives

# A tibble: 7 × 3
     ID Q1     Q2     
  <int> <chr>  <chr>
1     1 Purple Average
2     1 Red    No
3     1 Blue   Yes
4     2 Red    No
5     2 Blue   No
6     2 Yellow Average
7     3 Red    Average
like image 154
ThomasIsCoding Avatar answered Dec 23 '25 22:12

ThomasIsCoding


Try this:

library(dplyr)
library(tidyr)
pivot_longer(df, cols = -c(ID, Q1), values_to = "Q2") |>
  separate_longer_delim(Q1, delim = ",") |>
  mutate(Q1 = trimws(Q1)) |>
  filter(sub(".* ", "", name) == Q1) |>
  select(-name)
# # A tibble: 7 × 3
#      ID Q1     Q2     
#   <int> <chr>  <chr>  
# 1     1 Purple Average
# 2     1 Red    No     
# 3     1 Blue   Yes    
# 4     2 Red    No     
# 5     2 Blue   No     
# 6     2 Yellow Average
# 7     3 Red    Average

Data, adding check.names=:

df <- data.frame(
  ID = 1:3,
  Q1 = c("Purple, Red, Blue", "Red, Blue, Yellow", "Red"),
  "Q2: Is it your favorite color? - Purple" = c("Average", NA, NA),
  "Q2: Is it your favorite color? - Red" = c("No", "No", "Average"),
  "Q2: Is it your favorite color? - Blue" = c("Yes", "No", NA),
  "Q2: Is it your favorite color? - Yellow" = c(NA, "Average", NA),
  check.names = FALSE)
like image 45
r2evans Avatar answered Dec 23 '25 21:12

r2evans