I have a survey data set in wide form. For a particular question, a set of variables was created in the raw data to represent different the fact that the survey question was asked on a particular month.
I wish to create a new set of variables that have month-invariant names; the value of these variables will correspond to the value of a month-variant question for the month observed.
Please see an example / fictitious data set:
require(data.table)
data <- data.table(month = rep(c('may', 'jun', 'jul'), each = 5),
may.q1 = rep(c('yes', 'no', 'yes'), each = 5),
jun.q1 = rep(c('breakfast', 'lunch', 'dinner'), each = 5),
jul.q1 = rep(c('oranges', 'apples', 'oranges'), each = 5),
may.q2 = rep(c('econ', 'math', 'science'), each = 5),
jun.q2 = rep(c('sunny', 'foggy', 'cloudy'), each = 5),
jul.q2 = rep(c('no rain', 'light mist', 'heavy rain'), each = 5))
In this survey, there are really only two questions: "q1" and "q2". Each of these questions is repeatedly asked for several months. However, the observation contains a valid response only if the month observed in the data matches up with the survey question for a particular month.
For example: "may.q1" is observed as "yes" for any observation in "May". I would like a new "Q1" variable to represent "may.q1", "jun.q1", and "jul.q1". The value of "Q1" will take on the value of "may.q1" when the month is "may", and the value of "Q1" will take on the value of "jun.q1" when the month is "jun".
If I were to try and do this by hand using data table, I would want something like:
mdata <- data[month == 'may', c('month', 'may.q1', 'may.q2'), with = F]
setnames(mdata, names(mdata), gsub('may\\.', '', names(mdata)))
I would want this repeated "by = month".
If I were to use the "plyr" package for a data frame, I would solve using the following approach:
require(plyr)
data <- data.frame(data)
mdata <- ddply(data, .(month), function(dfmo) {
dfmo <- dfmo[, c(1, grep(dfmo$month[1], names(dfmo)))]
names(dfmo) <- gsub(paste0(dfmo$month[1], '\\.'), '', names(dfmo))
return(dfmo)
})
Any help using a data.table method would be greatly appreciated, as my data are large. Thank you.
The most general way to subset a data frame by rows and/or columns is the base R Extract[] function, indicated by matched square brackets instead of the usual matched parentheses. For a data frame named d the general format is d[rows, columms] .
A Row Subset is a selection of the rows within a whole table being viewed within the application, or equivalently a new table composed from some subset of its rows. You can define these and use them in several different ways; the usefulness comes from defining them in one context and using them in another.
Subsetting rows using the subset function The subset function with a logical statement will let you subset the data frame by observations.
A different way to illustrate :
data[, .SD[,paste0(month,c(".q1",".q2")), with=FALSE], by=month]
month may.q1 may.q2
1: may yes econ
2: may yes econ
3: may yes econ
4: may yes econ
5: may yes econ
6: jun lunch foggy
7: jun lunch foggy
8: jun lunch foggy
9: jun lunch foggy
10: jun lunch foggy
11: jul oranges heavy rain
12: jul oranges heavy rain
13: jul oranges heavy rain
14: jul oranges heavy rain
15: jul oranges heavy rain
But note the column names come from the first group (can rename afterwards using setnames
). And it may not be the most efficient if there are a great number of columns with only a few needed. In that case Arun's solution melting to long format should be faster.
Edit: Seems very inefficient on bigger data. Check out @MatthewDowle's answer for a really fast and neat solution.
Here's a solution using data.table
.
dd <- melt.dt(data, id.var=c("month"))[month == gsub("\\..*$", "", ind)][,
ind := gsub("^.*\\.", "", ind)][, split(values, ind), by=list(month)]
The function melt.dt
is a small function (still more improvements to be made) I wrote to melt
a data.table
similar to that of the melt
function in plyr
(copy/paste this function shown below before trying out the code above).
melt.dt <- function(DT, id.var) {
stopifnot(inherits(DT, "data.table"))
measure.var <- setdiff(names(DT), id.var)
ind <- rep.int(measure.var, rep.int(nrow(DT), length(measure.var)))
m1 <- lapply(c("list", id.var), as.name)
m2 <- as.call(lapply(c("factor", "ind"), as.name))
m3 <- as.call(lapply(c("c", measure.var), as.name))
quoted <- as.call(c(m1, ind = m2, values = m3))
DT[, eval(quoted)]
}
The idea: First melt the data.table
with id.var = month
column. Now, all your melted column names are of the form month.question
. So, by removing ".question" from this melted column and equating with month
column, we can remove all unnecessary entries. Once we did this, we don't need the "month." in the melted column "ind" anymore. So, we use gsub
to remove "month." to retain just q1, q2
etc.. After this, we have to reshape
(or cast
) it. This is done by grouping by month
and splitting the values
column by ind
(which has either q1
or q2
. So, you'll get 2 columns for every month (which is then stitched together) to get your desired output.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With