I am looking for some advice on some data restructuring. I am collecting some data using Google Forms which I download as a csv file and looks something like the following:
# alpha beta option
# 6 8, 9, 10, 11 apple
# 9 6 pear
# 1 6 apple
# 3 8, 9 pear
# 3 6, 8 lime
# 3 1 apple
# 2, 4, 7, 11 9 lime
The data has two variables (alpha and beta) that each list numbers. For the majority of my data there is only one number in each variable. However, for some observations there can be two, three or even up to ten numbers. This is because these are responses gathered using the 'checkbox' option in google forms which allows multiple answers to one survey question. Also, it may be of importance to some potential solutions that google forms returns leading whitespace before each of the multiple answers.
In my real data this occurs in only a very small proportion of all observations, the above is a more condensed example. There are several other variables in the dataset. Here I am just including one called 'option' that contains factors.
What I need to do is to duplicate all observations that contain multiple numbers in either the 'alpha' or the 'beta' variable. The number of duplicated rows should be equal to the number of numbers that exist in the alpha or beta variable. Then, I need to replace the sequence of numbers in the 'alpha' or 'beta' variables with each number independently. That would result in something like the following:
# alpha beta option
# 6 8 apple
# 6 9 apple
# 6 10 apple
# 6 11 apple
# 9 6 pear
# 1 6 apple
# 3 8 pear
# 3 9 pear
# 3 6 lime
# 3 8 lime
# 3 1 apple
# 2 9 lime
# 4 9 lime
# 7 9 lime
# 11 9 lime
Here is the data that reproduces the original example data above. I have called the dataframe 'demo':
demo<-structure(list(alpha = structure(c(4L, 5L, 1L, 3L, 3L, 3L, 2L), .Label =
c("1","2, 4, 7, 11", "3", "6", "9"), class = "factor"), beta = structure(c(5L, 2L, 2L,
4L, 3L, 1L, 6L), .Label = c("1", "6", "6, 8", "8, 9", "8, 9, 10, 11", "9"), class =
"factor"), option = structure(c(1L, 3L, 1L, 3L, 2L, 1L, 2L), .Label = c("apple",
"lime", "pear"), class = "factor")), .Names = c("alpha", "beta", "option"), class =
"data.frame", row.names = c(NA, -7L))
OK. So I think I have written some code that in a very long-winded fashion does lead to the new dataframe I am looking for. However, it feels like there must be a more elegant and better way of doing it.
Basically, I work on the 'alpha' variable first. I first subset the observations based on whether commas exist in the variable or not. With the observations that contain commas, I then use strsplit to separate the numbers. I then count how many numbers exist for each observation and duplicate each observation by that. I then melt the split numbers into a dataframe with all the numbers in a variable named 'value'. I simply then replace the 'alpha' variable with the data in the melted 'value' variable. I then rbind this back with the data that did not contain commas. I then use this df and work on the 'beta' variable....
Here is my solution (it seems to work?):
library(reshape2)
demo$a<-grepl(",", demo$alpha)
demo.atrue <- demo[ which(demo$a=='TRUE'), ]
demo.afalse <- demo[ which(demo$a=='FALSE'), ]
demo.atrue$alpha<-as.character(demo.atrue$alpha)
temp<-strsplit(demo.atrue$alpha, ",")
temp.lengths<-lapply(temp, length)
for (i in 1:length(temp)) {
df.expanded <- demo.atrue[rep(row.names(demo.atrue), temp.lengths), 1:3]
}
temp.melt<-melt(temp)
df.expanded$alpha<-temp.melt$value
demo.afalse<-demo.afalse[c(1:3)]
demonew<-rbind(demo.afalse, df.expanded)
demonew$b<-grepl(",", demonew$beta)
demonew.btrue <- demonew[ which(demonew$b=='TRUE'), ]
demonew.bfalse <- demonew[ which(demonew$b=='FALSE'), ]
demonew.btrue$beta<-as.character(demonew.btrue$beta)
temp<-strsplit(demonew.btrue$beta, ",")
temp.lengths<-lapply(temp, length)
for (i in 1:length(temp)) {
df.expanded1 <- demonew.btrue[rep(row.names(demonew.btrue), temp.lengths), 1:3]
}
temp.melt<-melt(temp)
df.expanded1$beta<-temp.melt$value
demonew.bfalse<-demonew.bfalse[c(1:3)]
demonew1<-rbind(df.expanded1, demonew.bfalse)
demonew1 #this seems to work, but doesn't feel very efficient
As well as perhaps not being very efficient, I am not sure whether this will work in all conditions. In particular if multiple numbers exist in both the 'alpha' and 'beta' variables for the same observation. I have tested it with a few examples and it seems ok, but I am not confident with it.
Thank you for any consideration.
You can use my cSplit
function, nested twice, like this:
cSplit(cSplit(demo, "alpha", ",", "long"), "beta", ",", "long")
# alpha beta option
# 1: 6 8 apple
# 2: 6 9 apple
# 3: 6 10 apple
# 4: 6 11 apple
# 5: 9 6 pear
# 6: 1 6 apple
# 7: 3 8 pear
# 8: 3 9 pear
# 9: 3 6 lime
# 10: 3 8 lime
# 11: 3 1 apple
# 12: 2 9 lime
# 13: 4 9 lime
# 14: 7 9 lime
# 15: 11 9 lime
Some benchmarks:
More interesting sample data. 700 rows instead of 7 (still quite a small dataset)...
demo <- do.call(rbind, replicate(100, demo, FALSE))
library(data.table)
demo2 <- data.table(demo)
Functions to test...
## MrFlick's
fun1 <- function() {
do.call(rbind, with(demo, Map(expand.grid,
alpha = strsplit(alpha,", "),
beta = strsplit(beta, ", "),
option = option
)))
}
## Mine
fun2 <- function() {
cSplit(cSplit(demo2, "alpha", ",", "long"), "beta", ",", "long")
}
## thelatemail's one-liner
fun3 <- function() {
do.call(rbind,do.call(Map, c(expand.grid, lapply(demo, strsplit, ", "))))
}
The actual benchmarking...
library(microbenchmark)
microbenchmark(MF = fun1(), AM = fun2(), TH = fun3(), times = 10)
# Unit: milliseconds
# expr min lq median uq max neval
# MF 785.34875 789.94924 800.11046 800.93643 813.62390 10
# AM 11.54569 11.93483 12.14181 12.31329 12.93208 10
# TH 790.46069 799.68518 803.47294 827.69520 899.11219 10
Actually this shouldn't be too bad. First, for simplicity, i'm going to convert all the columns to characters to make the later splits easier
demo[] <- lapply(demo, as.character)
Now let's do the hard work. Basically i'll split the "alpha" and "beta" columns on the ", " separator. Then i'll use expand.grid
to combine all the elements of "alpha","beta", and "option." This will take care of repeating the necessary rows and will work if both "alpha" and "beta" have multiple values. Finally, i'll re-combine all the newly generated rows into one great big data.frame. Here's the code
do.call(rbind, with(demo, Map(expand.grid,
alpha = strsplit(alpha,", "),
beta = strsplit(beta, ", "),
option = option
)))
And that's it. It will return
alpha beta option
1 6 8 apple
2 6 9 apple
3 6 10 apple
4 6 11 apple
5 9 6 pear
6 1 6 apple
7 3 8 pear
8 3 9 pear
9 3 6 lime
10 3 8 lime
11 3 1 apple
12 2 9 lime
13 4 9 lime
14 7 9 lime
15 11 9 lime
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With