Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merge different number of columns in R

Tags:

r

I am attempting to merge two or more columns in an automated way in some survey data. Example data:

data <- data.frame("Q1: What is your gender?" = c("Male","Male",NA,NA,"Male"),
                   "Q1: What is your gender?" = c(NA,NA,"Female","Female",NA),
                   "Q2: Where do you live?" = c("North","North",NA,NA,NA),
                   "Q2: Where do you live?" = c(NA,NA,NA,NA,"South"),
                   "Q2: Where do you live?" = c(NA,NA,NA,"West",NA),
                   "Q2: Where do you live?" = c(NA,NA,"East",NA,NA))
data[] <- lapply(data, as.character)

And this is what I want to achieve:

data.wanted <- data.frame("Q1: What is your gender?" = c("Male","Male","Female","Female","Male"),
                          "Q2: Where do you live?" = c("North","North","East","West","East"))
data.wanted[] <- lapply(data.wanted, as.character)

Each respondent only has one non-NA response per question. I had a look at (amongst others) Merging two columns into one in R, but can't figure out how to use coalesce across many questions which may represent a varying number of columns. I could do for each question:

data["Q1"] <- coalesce(data[,1],data[,2])
data["Q2"] <- coalesce(data[,3],data[,4],data[,5],data[,6])

i.e. the manual way. However, since I have many questions, each following the above structure, I am really looking for an automated way to do this, through either looping, where I refer to the column names through grep or some alternative method.

Any suggestions are much appreciated!

like image 299
thesixmax Avatar asked Jan 01 '23 08:01

thesixmax


2 Answers

Here is one way to do this in base R :

as.data.frame(sapply(split.default(data, sub("\\.\\d+$", "", names(data))), 
              function(x) do.call(pmax, c(x, na.rm = TRUE))))

#  Q1..What.is.your.gender. Q2..Where.do.you.live.
#1                     Male                  North
#2                     Male                  North
#3                   Female                   East
#4                   Female                   West
#5                     Male                  South

Explanation:

It is easier to understand how this works if we break it down step-by-step :

names returns the column names of the data

names(data)
#[1] "Q1..What.is.your.gender." "Q1..What.is.your.gender..1" "Q2..Where.do.you.live."     
#    "Q2..Where.do.you.live..1" "Q2..Where.do.you.live..2"   "Q2..Where.do.you.live..3"  

Using sub, we remove the additional . and a number from each name so we get common column names like this

sub("\\.\\d+$", "", names(data))
#[1] "Q1..What.is.your.gender." "Q1..What.is.your.gender." "Q2..Where.do.you.live."
#    "Q2..Where.do.you.live." "Q2..Where.do.you.live." "Q2..Where.do.you.live."

We use this in split.default to split data based on similar column names

split.default(data, sub("\\.\\d+$", "", names(data)))
#$Q1..What.is.your.gender.
#  Q1..What.is.your.gender. Q1..What.is.your.gender..1
#1                     Male                       <NA>
#2                     Male                       <NA>
#3                     <NA>                     Female
#4                     <NA>                     Female
#5                     Male                       <NA>

#$Q2..Where.do.you.live.
#  Q2..Where.do.you.live. Q2..Where.do.you.live..1 Q2..Where.do.you.live..2 Q2..Where.do.you.live..3
#1                  North                     <NA>                     <NA>                     <NA>
#2                  North                     <NA>                     <NA>                     <NA>
#3                   <NA>                     <NA>                     <NA>                     East
#4                   <NA>                     <NA>                     West                     <NA>
#5                   <NA>                    South                     <NA>                     <NA>

We use sapply to iterate over each list and select the max value from each row removing NA values (Actually, in this case, using pmin would also work) which returns a matrix and we convert it into a dataframe.

as.data.frame(sapply(split.default(data, sub("\\.\\d+$", "", names(data))), 
        function(x) do.call(pmax, c(x, na.rm = TRUE))))

#  Q1..What.is.your.gender. Q2..Where.do.you.live.
#1                     Male                  North
#2                     Male                  North
#3                   Female                   East
#4                   Female                   West
#5                     Male                  South
like image 180
Ronak Shah Avatar answered Jan 02 '23 22:01

Ronak Shah


You can use some dplyr and tidyr functions to reshape the data to cope with not knowing how many columns each question will need to cover. Assign a row number to keep the data organized, then convert to long format.

I assume you noticed that the data frame's names get converted to syntactically-correct ones (no duplicate names, no spaces, etc). So after reshaping, clean up the text that came from column names so that both "Q1..What.is.your.gender." and "Q1..What.is.your.gender..1" become "Q1..What.is.your.gender". Then filter to keep just the rows with actual values, and reshape back to wide. Drop the row number column if you no longer need it.

library(dplyr)

data %>%
  mutate(row = row_number()) %>%
  tidyr::pivot_longer(-row) %>%
  mutate(name = sub("\\.+\\d?$", "", name)) %>%
  filter(!is.na(value)) %>%
  tidyr::pivot_wider()
#> # A tibble: 5 x 3
#>     row Q1..What.is.your.gender Q2..Where.do.you.live
#>   <int> <fct>                   <fct>                
#> 1     1 Male                    North                
#> 2     2 Male                    North                
#> 3     3 Female                  East                 
#> 4     4 Female                  West                 
#> 5     5 Male                    South
like image 44
camille Avatar answered Jan 02 '23 23:01

camille