Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split strings by commas only if substrings are elements of another vector

Tags:

string

r

strsplit

I have a set of survey responses where respondents could select zero or more options to answer the question "What types of fruit do you like?". There was also a space for a write-in answer. In the results spreadsheet, each person's response is in one cell with the different types of fruit separated by commas, like so:

(df <- data.frame(id = c("A", "B", "C", "D", "E"), 
                 data = c("oranges, apples, peaches, cherries, pineapples, strawberries",
                          "oranges, peaches, pears", 
                          "pears, nectarines, cherries (bing, rainier)", 
                          "apples, peaches, nectarines", 
                          ""), 
                 stringsAsFactors = FALSE))

#   id                                                         data
# 1  A oranges, apples, peaches, cherries, pineapples, strawberries
# 2  B                                      oranges, peaches, pears
# 3  C                  pears, nectarines, cherries (bing, rainier)
# 4  D                                  apples, peaches, nectarines
# 5  E  

What I want to do is split up the responses into a long-format table, which I've nearly accomplished using the code at the bottom. However, some respondents included commas in their write-in responses, and I don't want to split their answers on the commas. I know what all the original multiple choice options were; how can I split up only these answers, leaving the write-ins (with commas) intact? I want to end up with a data frame like this:

   id                               data
1   A                            oranges
2   A                             apples
3   A                            peaches
4   A cherries, pineapples, strawberries
5   B                            oranges
6   B                            peaches
7   B                              pears
8   C                              pears
9   C                         nectarines
10  C           cherries (bing, rainier)
11  D                             apples
12  D                            peaches
13  D                         nectarines

The multiple choice options are:

mc_answers <- c("oranges", "plums", "apples", "peaches", "pears", "nectarines")

What I've accomplished so far is:

# use strsplit to create a list of the types of fruit each person likes
datalist <- strsplit(df$data, ", ")
names(datalist) <- df$id

# remove zero-length list elements (person E doesn't like any fruit)
datalist <- Filter(length, datalist)

# convert list elements to data frames
datalist_dfs <- lapply(datalist, data.frame, stringsAsFactors = FALSE)
datalist_dfs <- lapply(datalist_dfs, setNames, "data") # name each column 'data'

# add id column to each data frame
data_long <- mapply(function(x, y) "[<-"(x, "id", value = y), datalist_dfs, 
                    names(datalist_dfs), SIMPLIFY = FALSE)

# combine into one big data frame
(data_per_person <- do.call('rbind', data_long))
#               data id
# A.1        oranges  A
# A.2         apples  A
# A.3        peaches  A
# A.4       cherries  A   # should
# A.5     pineapples  A   # be one
# A.6   strawberries  A   # entry
# B.1        oranges  B
# B.2        peaches  B
# B.3          pears  B
# C.1          pears  C
# C.2     nectarines  C
# C.3 cherries (bing  C   # should be 
# C.4       rainier)  C   # one entry
# D.1         apples  D
# D.2        peaches  D
# D.3     nectarines  D

There are no rules for how many fruits a person could have selected, but if there is a write-in answer it is always last.

like image 598
Kara Woo Avatar asked Aug 02 '14 18:08

Kara Woo


1 Answers

After this line:

datalist <- Filter(length, datalist)

Do:

datalist <- lapply(datalist, function(x) {
   if(any(!x %in% mc_answers))
       c(x[x %in% mc_answers], paste(x[!x %in% mc_answers], collapse = ", "))
   else
       x[x %in% mc_answers]
})

Then run the rest of your code as-is so you end up with:

> (data_per_person <- do.call('rbind', data_long))
                                  data id
A.1                            oranges  A
A.2                             apples  A
A.3                            peaches  A
A.4 cherries, pineapples, strawberries  A
B.1                            oranges  B
B.2                            peaches  B
B.3                              pears  B
C.1                              pears  C
C.2                         nectarines  C
C.3           cherries (bing, rainier)  C
D.1                             apples  D
D.2                            peaches  D
D.3                         nectarines  D
like image 197
Thomas Avatar answered Nov 15 '22 05:11

Thomas