Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Duplicating observations of a dataframe, but also replacing specific variable values in R

I am looking for some advice on some data restructuring. I am collecting some data using Google Forms which I download as a csv file and looks something like the following:

# alpha                 beta    option
#  6             8, 9, 10, 11    apple
#  9                        6     pear
#  1                        6    apple
#  3                     8, 9     pear
#  3                     6, 8     lime
#  3                        1    apple
#  2, 4, 7, 11              9     lime

The data has two variables (alpha and beta) that each list numbers. For the majority of my data there is only one number in each variable. However, for some observations there can be two, three or even up to ten numbers. This is because these are responses gathered using the 'checkbox' option in google forms which allows multiple answers to one survey question. Also, it may be of importance to some potential solutions that google forms returns leading whitespace before each of the multiple answers.

In my real data this occurs in only a very small proportion of all observations, the above is a more condensed example. There are several other variables in the dataset. Here I am just including one called 'option' that contains factors.

What I need to do is to duplicate all observations that contain multiple numbers in either the 'alpha' or the 'beta' variable. The number of duplicated rows should be equal to the number of numbers that exist in the alpha or beta variable. Then, I need to replace the sequence of numbers in the 'alpha' or 'beta' variables with each number independently. That would result in something like the following:

#  alpha  beta   option
#     6    8     apple
#     6    9     apple
#     6   10     apple
#     6   11     apple
#     9    6      pear
#     1    6     apple
#     3    8      pear
#     3    9      pear
#     3    6      lime
#     3    8      lime
#     3    1     apple
#     2    9      lime
#     4    9      lime
#     7    9      lime
#    11    9      lime

Here is the data that reproduces the original example data above. I have called the dataframe 'demo':

demo<-structure(list(alpha = structure(c(4L, 5L, 1L, 3L, 3L, 3L, 2L), .Label =
 c("1","2, 4, 7, 11", "3", "6", "9"), class = "factor"), beta = structure(c(5L, 2L, 2L, 
4L, 3L, 1L, 6L), .Label = c("1", "6", "6, 8", "8, 9", "8, 9, 10, 11", "9"), class =   
"factor"), option = structure(c(1L, 3L, 1L, 3L, 2L, 1L, 2L), .Label = c("apple", 
"lime", "pear"), class = "factor")), .Names = c("alpha", "beta", "option"), class =   
"data.frame", row.names = c(NA, -7L))

OK. So I think I have written some code that in a very long-winded fashion does lead to the new dataframe I am looking for. However, it feels like there must be a more elegant and better way of doing it.

Basically, I work on the 'alpha' variable first. I first subset the observations based on whether commas exist in the variable or not. With the observations that contain commas, I then use strsplit to separate the numbers. I then count how many numbers exist for each observation and duplicate each observation by that. I then melt the split numbers into a dataframe with all the numbers in a variable named 'value'. I simply then replace the 'alpha' variable with the data in the melted 'value' variable. I then rbind this back with the data that did not contain commas. I then use this df and work on the 'beta' variable....

Here is my solution (it seems to work?):

library(reshape2)

demo$a<-grepl(",", demo$alpha)
demo.atrue <- demo[ which(demo$a=='TRUE'), ]
demo.afalse <- demo[ which(demo$a=='FALSE'), ]
demo.atrue$alpha<-as.character(demo.atrue$alpha)
temp<-strsplit(demo.atrue$alpha, ",")
temp.lengths<-lapply(temp, length)

for (i in 1:length(temp)) { 
df.expanded <- demo.atrue[rep(row.names(demo.atrue), temp.lengths), 1:3]
}

temp.melt<-melt(temp)
df.expanded$alpha<-temp.melt$value
demo.afalse<-demo.afalse[c(1:3)]
demonew<-rbind(demo.afalse, df.expanded)



demonew$b<-grepl(",", demonew$beta)
demonew.btrue <- demonew[ which(demonew$b=='TRUE'), ]
demonew.bfalse <- demonew[ which(demonew$b=='FALSE'), ]
demonew.btrue$beta<-as.character(demonew.btrue$beta)

temp<-strsplit(demonew.btrue$beta, ",")
temp.lengths<-lapply(temp, length)

for (i in 1:length(temp)) { 
  df.expanded1 <- demonew.btrue[rep(row.names(demonew.btrue), temp.lengths), 1:3]
}

temp.melt<-melt(temp)
df.expanded1$beta<-temp.melt$value
demonew.bfalse<-demonew.bfalse[c(1:3)]
demonew1<-rbind(df.expanded1, demonew.bfalse)

demonew1  #this seems to work, but doesn't feel very efficient

As well as perhaps not being very efficient, I am not sure whether this will work in all conditions. In particular if multiple numbers exist in both the 'alpha' and 'beta' variables for the same observation. I have tested it with a few examples and it seems ok, but I am not confident with it.

Thank you for any consideration.

like image 715
jalapic Avatar asked Jun 27 '14 01:06

jalapic


2 Answers

You can use my cSplit function, nested twice, like this:

cSplit(cSplit(demo, "alpha", ",", "long"), "beta", ",", "long")
#     alpha beta option
#  1:     6    8  apple
#  2:     6    9  apple
#  3:     6   10  apple
#  4:     6   11  apple
#  5:     9    6   pear
#  6:     1    6  apple
#  7:     3    8   pear
#  8:     3    9   pear
#  9:     3    6   lime
# 10:     3    8   lime
# 11:     3    1  apple
# 12:     2    9   lime
# 13:     4    9   lime
# 14:     7    9   lime
# 15:    11    9   lime

Some benchmarks:

More interesting sample data. 700 rows instead of 7 (still quite a small dataset)...

demo <- do.call(rbind, replicate(100, demo, FALSE))
library(data.table)
demo2 <- data.table(demo)

Functions to test...

## MrFlick's
fun1 <- function() {
  do.call(rbind, with(demo, Map(expand.grid,
                                alpha = strsplit(alpha,", "),
                                beta = strsplit(beta, ", "),
                                option = option
  )))
} 

## Mine
fun2 <-  function() {
  cSplit(cSplit(demo2, "alpha", ",", "long"), "beta", ",", "long")
} 

## thelatemail's one-liner
fun3 <- function() {
  do.call(rbind,do.call(Map, c(expand.grid, lapply(demo, strsplit, ", "))))
} 

The actual benchmarking...

library(microbenchmark)
microbenchmark(MF = fun1(), AM = fun2(), TH = fun3(), times = 10)
# Unit: milliseconds
#  expr       min        lq    median        uq       max neval
#    MF 785.34875 789.94924 800.11046 800.93643 813.62390    10
#    AM  11.54569  11.93483  12.14181  12.31329  12.93208    10
#    TH 790.46069 799.68518 803.47294 827.69520 899.11219    10
like image 80
A5C1D2H2I1M1N2O1R2T1 Avatar answered Sep 23 '22 12:09

A5C1D2H2I1M1N2O1R2T1


Actually this shouldn't be too bad. First, for simplicity, i'm going to convert all the columns to characters to make the later splits easier

demo[] <- lapply(demo, as.character)

Now let's do the hard work. Basically i'll split the "alpha" and "beta" columns on the ", " separator. Then i'll use expand.grid to combine all the elements of "alpha","beta", and "option." This will take care of repeating the necessary rows and will work if both "alpha" and "beta" have multiple values. Finally, i'll re-combine all the newly generated rows into one great big data.frame. Here's the code

do.call(rbind, with(demo, Map(expand.grid, 
     alpha = strsplit(alpha,", "), 
     beta = strsplit(beta, ", "), 
     option = option
)))

And that's it. It will return

   alpha beta option
1      6    8  apple
2      6    9  apple
3      6   10  apple
4      6   11  apple
5      9    6   pear
6      1    6  apple
7      3    8   pear
8      3    9   pear
9      3    6   lime
10     3    8   lime
11     3    1  apple
12     2    9   lime
13     4    9   lime
14     7    9   lime
15    11    9   lime
like image 38
MrFlick Avatar answered Sep 24 '22 12:09

MrFlick