Split a column of concatenated comma-delimited data and recode output as factors

Question

I am trying to clean up some data that has been incorrectly entered. The question for the variable allows for multiple responses out of five choices, numbered as 1 to 5. The data has been entered in the following manner (this is just an example--there are many more variables and many more observations in the actual data frame):

data
          V1
1    1, 2, 3
2    1, 2, 4
3 2, 3, 4, 5
4    1, 3, 4
5    1, 3, 5
6 2, 3, 4, 5

Here's some code to recreate that example data:

data = data.frame(V1 = c("1, 2, 3", "1, 2, 4", "2, 3, 4, 5", 
                         "1, 3, 4", "1, 3, 5", "2, 3, 4, 5"))

What I actually need is the data to be treated more... binary--like a set of "yes/no" questions--entered in a data frame that looks more like:

data
    V1.1  V1.2  V1.3  V1.4  V1.5
1      1     1     1    NA    NA
2      1     1    NA     1    NA
3     NA     1     1     1     1
4      1    NA     1     1    NA
5      1    NA     1    NA     1
6     NA     1     1     1     1

The actual variable names don't matter at the moment--I can easily fix that. Also, it doesn't matter too much whether the missing elements are "O", "NA", or blank--again, that's something I can fix later.

I've tried using the transform function from the reshape package as well as a fed different things with strsplit, but I can't get either to do what I am looking for. I've also looked at many other related questions on Stackoverflow, but they don't seem to be quite the same problem.

Eric Boorman · Accepted Answer

This is my first time answering a question on stackoverflow. Please let me know if this makes sense.

I had this problem when I was working with some qualtrics data. I used grepl to solve the issue. I've included a link to r documentation.
https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/grep

As I understand it, Grepl looks for a pattern within a set of data and will indicate TRUE or FALSE if the pattern exists or doesn't exist respectively. I created a new variable. If the pattern exists, then I coded the new variable as 1. If the pattern doesn't exist, then I coded it as 0. Here is what it would look like for one question.

data$V1.1<- NULL
data$V1.1<- 0
data$V1.1[grepl (1, data$V1)] <- 1
table (data$V1.1, exclude = FALSE)

This code can then be repeated for the remaining for questions. If there are only a few response options then this code should work fine. But if there are a lot of response options, then you might want to set up a loop.

A5C1D2H2I1M1N2O1R2T1 · Answer

A long time later, I finally got around to creating a package ("splitstackshape") that deals with this kind of data in an efficient manner. So, for the convenience of others (and some self-promotion, of course) here's a compact solution.

The relevant function for this problem is cSplit_e.

First, the default settings, which retains the original column and uses NA as the fill:

library(splitstackshape)
cSplit_e(data, "V1")
#           V1 V1_1 V1_2 V1_3 V1_4 V1_5
# 1    1, 2, 3    1    1    1   NA   NA
# 2    1, 2, 4    1    1   NA    1   NA
# 3 2, 3, 4, 5   NA    1    1    1    1
# 4    1, 3, 4    1   NA    1    1   NA
# 5    1, 3, 5    1   NA    1   NA    1
# 6 2, 3, 4, 5   NA    1    1    1    1

Second, with dropping the original column and using 0 as the fill.

cSplit_e(data, "V1", drop = TRUE, fill = 0)
#   V1_1 V1_2 V1_3 V1_4 V1_5
# 1    1    1    1    0    0
# 2    1    1    0    1    0
# 3    0    1    1    1    1
# 4    1    0    1    1    0
# 5    1    0    1    0    1
# 6    0    1    1    1    1

csgillespie · Answer

You just need to write a function and use apply. First some dummy data:

##Make sure you're not using factors
dd = data.frame(V1 = c("1, 2, 3", "1, 2, 4", "2, 3, 4, 5", 
                         "1, 3, 4", "1, 3, 5", "2, 3, 4, 5"), 
                     stringsAsFactors=FALSE)

Next, create a function that takes in a row and transforms as necessary

make_row = function(i, ncol=5) {
  ##Could make the default NA if needed
  m = numeric(ncol)
  v = as.numeric(strsplit(i, ",")[[1]])
  m[v] = 1
  return(m)
}

Then use apply and transpose the result

t(apply(dd, 1, make_row))

Split a column of concatenated comma-delimited data and recode output as factors

Tags:

split

r

A5C1D2H2I1M1N2O1R2T1

3 Answers

Eric Boorman

A5C1D2H2I1M1N2O1R2T1

csgillespie

Recent Activity

Donate For Us

Split a column of concatenated comma-delimited data and recode output as factors

Tags:

split

r

A5C1D2H2I1M1N2O1R2T1

3 Answers

Eric Boorman

A5C1D2H2I1M1N2O1R2T1

csgillespie

Related questions

Recent Activity

Donate For Us