Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split a column of concatenated comma-delimited data and recode output as factors

Tags:

split

r

I am trying to clean up some data that has been incorrectly entered. The question for the variable allows for multiple responses out of five choices, numbered as 1 to 5. The data has been entered in the following manner (this is just an example--there are many more variables and many more observations in the actual data frame):

data
          V1
1    1, 2, 3
2    1, 2, 4
3 2, 3, 4, 5
4    1, 3, 4
5    1, 3, 5
6 2, 3, 4, 5

Here's some code to recreate that example data:

data = data.frame(V1 = c("1, 2, 3", "1, 2, 4", "2, 3, 4, 5", 
                         "1, 3, 4", "1, 3, 5", "2, 3, 4, 5"))

What I actually need is the data to be treated more... binary--like a set of "yes/no" questions--entered in a data frame that looks more like:

data
    V1.1  V1.2  V1.3  V1.4  V1.5
1      1     1     1    NA    NA
2      1     1    NA     1    NA
3     NA     1     1     1     1
4      1    NA     1     1    NA
5      1    NA     1    NA     1
6     NA     1     1     1     1

The actual variable names don't matter at the moment--I can easily fix that. Also, it doesn't matter too much whether the missing elements are "O", "NA", or blank--again, that's something I can fix later.

I've tried using the transform function from the reshape package as well as a fed different things with strsplit, but I can't get either to do what I am looking for. I've also looked at many other related questions on Stackoverflow, but they don't seem to be quite the same problem.

like image 900
A5C1D2H2I1M1N2O1R2T1 Avatar asked Apr 11 '12 06:04

A5C1D2H2I1M1N2O1R2T1


3 Answers

This is my first time answering a question on stackoverflow. Please let me know if this makes sense.

I had this problem when I was working with some qualtrics data. I used grepl to solve the issue. I've included a link to r documentation.
https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/grep

As I understand it, Grepl looks for a pattern within a set of data and will indicate TRUE or FALSE if the pattern exists or doesn't exist respectively. I created a new variable. If the pattern exists, then I coded the new variable as 1. If the pattern doesn't exist, then I coded it as 0. Here is what it would look like for one question.

data$V1.1<- NULL
data$V1.1<- 0
data$V1.1[grepl (1, data$V1)] <- 1
table (data$V1.1, exclude = FALSE)

This code can then be repeated for the remaining for questions. If there are only a few response options then this code should work fine. But if there are a lot of response options, then you might want to set up a loop.

like image 170
Eric Boorman Avatar answered Sep 29 '22 12:09

Eric Boorman


A long time later, I finally got around to creating a package ("splitstackshape") that deals with this kind of data in an efficient manner. So, for the convenience of others (and some self-promotion, of course) here's a compact solution.

The relevant function for this problem is cSplit_e.

First, the default settings, which retains the original column and uses NA as the fill:

library(splitstackshape)
cSplit_e(data, "V1")
#           V1 V1_1 V1_2 V1_3 V1_4 V1_5
# 1    1, 2, 3    1    1    1   NA   NA
# 2    1, 2, 4    1    1   NA    1   NA
# 3 2, 3, 4, 5   NA    1    1    1    1
# 4    1, 3, 4    1   NA    1    1   NA
# 5    1, 3, 5    1   NA    1   NA    1
# 6 2, 3, 4, 5   NA    1    1    1    1

Second, with dropping the original column and using 0 as the fill.

cSplit_e(data, "V1", drop = TRUE, fill = 0)
#   V1_1 V1_2 V1_3 V1_4 V1_5
# 1    1    1    1    0    0
# 2    1    1    0    1    0
# 3    0    1    1    1    1
# 4    1    0    1    1    0
# 5    1    0    1    0    1
# 6    0    1    1    1    1
like image 37
A5C1D2H2I1M1N2O1R2T1 Avatar answered Sep 29 '22 12:09

A5C1D2H2I1M1N2O1R2T1


You just need to write a function and use apply. First some dummy data:

##Make sure you're not using factors
dd = data.frame(V1 = c("1, 2, 3", "1, 2, 4", "2, 3, 4, 5", 
                         "1, 3, 4", "1, 3, 5", "2, 3, 4, 5"), 
                     stringsAsFactors=FALSE)

Next, create a function that takes in a row and transforms as necessary

make_row = function(i, ncol=5) {
  ##Could make the default NA if needed
  m = numeric(ncol)
  v = as.numeric(strsplit(i, ",")[[1]])
  m[v] = 1
  return(m)
}

Then use apply and transpose the result

t(apply(dd, 1, make_row))
like image 38
csgillespie Avatar answered Sep 29 '22 13:09

csgillespie