I am attempting to merge two or more columns in an automated way in some survey data. Example data:
data <- data.frame("Q1: What is your gender?" = c("Male","Male",NA,NA,"Male"),
"Q1: What is your gender?" = c(NA,NA,"Female","Female",NA),
"Q2: Where do you live?" = c("North","North",NA,NA,NA),
"Q2: Where do you live?" = c(NA,NA,NA,NA,"South"),
"Q2: Where do you live?" = c(NA,NA,NA,"West",NA),
"Q2: Where do you live?" = c(NA,NA,"East",NA,NA))
data[] <- lapply(data, as.character)
And this is what I want to achieve:
data.wanted <- data.frame("Q1: What is your gender?" = c("Male","Male","Female","Female","Male"),
"Q2: Where do you live?" = c("North","North","East","West","East"))
data.wanted[] <- lapply(data.wanted, as.character)
Each respondent only has one non-NA
response per question. I had a look at (amongst others) Merging two columns into one in R, but can't figure out how to use coalesce
across many questions which may represent a varying number of columns. I could do for each question:
data["Q1"] <- coalesce(data[,1],data[,2])
data["Q2"] <- coalesce(data[,3],data[,4],data[,5],data[,6])
i.e. the manual way. However, since I have many questions, each following the above structure, I am really looking for an automated way to do this, through either looping, where I refer to the column names through grep
or some alternative method.
Any suggestions are much appreciated!
Here is one way to do this in base R :
as.data.frame(sapply(split.default(data, sub("\\.\\d+$", "", names(data))),
function(x) do.call(pmax, c(x, na.rm = TRUE))))
# Q1..What.is.your.gender. Q2..Where.do.you.live.
#1 Male North
#2 Male North
#3 Female East
#4 Female West
#5 Male South
It is easier to understand how this works if we break it down step-by-step :
names
returns the column names of the data
names(data)
#[1] "Q1..What.is.your.gender." "Q1..What.is.your.gender..1" "Q2..Where.do.you.live."
# "Q2..Where.do.you.live..1" "Q2..Where.do.you.live..2" "Q2..Where.do.you.live..3"
Using sub
, we remove the additional .
and a number from each name so we get common column names like this
sub("\\.\\d+$", "", names(data))
#[1] "Q1..What.is.your.gender." "Q1..What.is.your.gender." "Q2..Where.do.you.live."
# "Q2..Where.do.you.live." "Q2..Where.do.you.live." "Q2..Where.do.you.live."
We use this in split.default
to split data based on similar column names
split.default(data, sub("\\.\\d+$", "", names(data)))
#$Q1..What.is.your.gender.
# Q1..What.is.your.gender. Q1..What.is.your.gender..1
#1 Male <NA>
#2 Male <NA>
#3 <NA> Female
#4 <NA> Female
#5 Male <NA>
#$Q2..Where.do.you.live.
# Q2..Where.do.you.live. Q2..Where.do.you.live..1 Q2..Where.do.you.live..2 Q2..Where.do.you.live..3
#1 North <NA> <NA> <NA>
#2 North <NA> <NA> <NA>
#3 <NA> <NA> <NA> East
#4 <NA> <NA> West <NA>
#5 <NA> South <NA> <NA>
We use sapply
to iterate over each list and select the max value from each row removing NA
values (Actually, in this case, using pmin
would also work) which returns a matrix and we convert it into a dataframe.
as.data.frame(sapply(split.default(data, sub("\\.\\d+$", "", names(data))),
function(x) do.call(pmax, c(x, na.rm = TRUE))))
# Q1..What.is.your.gender. Q2..Where.do.you.live.
#1 Male North
#2 Male North
#3 Female East
#4 Female West
#5 Male South
You can use some dplyr
and tidyr
functions to reshape the data to cope with not knowing how many columns each question will need to cover. Assign a row number to keep the data organized, then convert to long format.
I assume you noticed that the data frame's names get converted to syntactically-correct ones (no duplicate names, no spaces, etc). So after reshaping, clean up the text that came from column names so that both "Q1..What.is.your.gender."
and "Q1..What.is.your.gender..1"
become "Q1..What.is.your.gender"
. Then filter to keep just the rows with actual values, and reshape back to wide. Drop the row number column if you no longer need it.
library(dplyr)
data %>%
mutate(row = row_number()) %>%
tidyr::pivot_longer(-row) %>%
mutate(name = sub("\\.+\\d?$", "", name)) %>%
filter(!is.na(value)) %>%
tidyr::pivot_wider()
#> # A tibble: 5 x 3
#> row Q1..What.is.your.gender Q2..Where.do.you.live
#> <int> <fct> <fct>
#> 1 1 Male North
#> 2 2 Male North
#> 3 3 Female East
#> 4 4 Female West
#> 5 5 Male South
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With