I have a set of survey responses where respondents could select zero or more options to answer the question "What types of fruit do you like?". There was also a space for a write-in answer. In the results spreadsheet, each person's response is in one cell with the different types of fruit separated by commas, like so:
(df <- data.frame(id = c("A", "B", "C", "D", "E"),
data = c("oranges, apples, peaches, cherries, pineapples, strawberries",
"oranges, peaches, pears",
"pears, nectarines, cherries (bing, rainier)",
"apples, peaches, nectarines",
""),
stringsAsFactors = FALSE))
# id data
# 1 A oranges, apples, peaches, cherries, pineapples, strawberries
# 2 B oranges, peaches, pears
# 3 C pears, nectarines, cherries (bing, rainier)
# 4 D apples, peaches, nectarines
# 5 E
What I want to do is split up the responses into a long-format table, which I've nearly accomplished using the code at the bottom. However, some respondents included commas in their write-in responses, and I don't want to split their answers on the commas. I know what all the original multiple choice options were; how can I split up only these answers, leaving the write-ins (with commas) intact? I want to end up with a data frame like this:
id data
1 A oranges
2 A apples
3 A peaches
4 A cherries, pineapples, strawberries
5 B oranges
6 B peaches
7 B pears
8 C pears
9 C nectarines
10 C cherries (bing, rainier)
11 D apples
12 D peaches
13 D nectarines
The multiple choice options are:
mc_answers <- c("oranges", "plums", "apples", "peaches", "pears", "nectarines")
What I've accomplished so far is:
# use strsplit to create a list of the types of fruit each person likes
datalist <- strsplit(df$data, ", ")
names(datalist) <- df$id
# remove zero-length list elements (person E doesn't like any fruit)
datalist <- Filter(length, datalist)
# convert list elements to data frames
datalist_dfs <- lapply(datalist, data.frame, stringsAsFactors = FALSE)
datalist_dfs <- lapply(datalist_dfs, setNames, "data") # name each column 'data'
# add id column to each data frame
data_long <- mapply(function(x, y) "[<-"(x, "id", value = y), datalist_dfs,
names(datalist_dfs), SIMPLIFY = FALSE)
# combine into one big data frame
(data_per_person <- do.call('rbind', data_long))
# data id
# A.1 oranges A
# A.2 apples A
# A.3 peaches A
# A.4 cherries A # should
# A.5 pineapples A # be one
# A.6 strawberries A # entry
# B.1 oranges B
# B.2 peaches B
# B.3 pears B
# C.1 pears C
# C.2 nectarines C
# C.3 cherries (bing C # should be
# C.4 rainier) C # one entry
# D.1 apples D
# D.2 peaches D
# D.3 nectarines D
There are no rules for how many fruits a person could have selected, but if there is a write-in answer it is always last.
After this line:
datalist <- Filter(length, datalist)
Do:
datalist <- lapply(datalist, function(x) {
if(any(!x %in% mc_answers))
c(x[x %in% mc_answers], paste(x[!x %in% mc_answers], collapse = ", "))
else
x[x %in% mc_answers]
})
Then run the rest of your code as-is so you end up with:
> (data_per_person <- do.call('rbind', data_long))
data id
A.1 oranges A
A.2 apples A
A.3 peaches A
A.4 cherries, pineapples, strawberries A
B.1 oranges B
B.2 peaches B
B.3 pears B
C.1 pears C
C.2 nectarines C
C.3 cherries (bing, rainier) C
D.1 apples D
D.2 peaches D
D.3 nectarines D
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With