I have data in an unusual format. What should be the variable names / column headings are currently row values, and what should be the row values are the variable names / column headings.
That is, I have a dataframe like this:
id <- seq(1, 5, 1)
good <- c('', 'Q4', 'Q4, Q2', '', '')
ok <- c('Q3, Q1', '', '', 'Q2', '')
bad <- c('', 'Q2', 'Q2', '', '')
data <- as.data.frame(cbind(id, good, ok, bad))
cols <- c('good', 'ok', 'bad')
data[cols] <- lapply(data[cols], as.character)
And I would like to convert it into a dataframe that looks like this:
id <- seq(1,5,1)
Q1 <- c('ok', '', '', '', '')
Q2 <- c('', 'bad', 'good, bad', 'ok', '') # Yes, it is possible to get multiple,
# conflicting responses to a question from one id.
Q3 <- c('ok', '', '', '', '')
Q4 <- c('', 'good', 'good', '', '')
data_new <- as.data.frame(cbind(id, Q1, Q2, Q3, Q4))
cols <- c('Q1', 'Q2', 'Q3', 'Q4')
data_new[cols] <- lapply(data_new[cols], as.character)
Some issues:
What are currently the row values can't just be transposed directlyto the column headings because sometimes multiple entries are recorded in a single cell and there should be only one value recorded in a column heading (e.g., I don't want a columnheading labelled 'Q2, Q4' in the updated dataframe).
On the other hand, when the column headings are moved to become row values, there will sometimes need to be multiple entries in the same cell. For example, the entry under the variable Q2 for id 3 should read 'good, bad' in the updated dataframe.
A dplyr solution is preferred but answers using other packages/base R are also welcomed.
Here is one way with gather/spread. We gather the data from 'wide' to 'long', filter out the rows based on the blank elements in 'val', expand the data by splitting at the delimiter (,) in 'val', grouped by 'id', 'val', paste the elements of 'key' and spread it back to 'wide' format
library(tidyverse)
data %>%
gather(key, val, -id) %>%
filter(val != "") %>%
separate_rows(val) %>%
group_by(id = factor(id, levels = 1:5), val) %>%
summarise(key = toString(key)) %>%
spread(val, key, fill = "", drop = FALSE)
# A tibble: 5 x 5
# Groups: id [5]
# id Q1 Q2 Q3 Q4
# <fct> <chr> <chr> <chr> <chr>
#1 1 ok "" ok ""
#2 2 "" bad "" good
#3 3 "" good, bad "" good
#4 4 "" ok "" ""
#5 5 "" "" "" ""
NOTE: The format is based on the print format of tibble. If we want to change to data.frame, use as.data.frame at the end
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With