Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performing Operations on a Subset Using Data Table

I have a survey data set in wide form. For a particular question, a set of variables was created in the raw data to represent different the fact that the survey question was asked on a particular month.

I wish to create a new set of variables that have month-invariant names; the value of these variables will correspond to the value of a month-variant question for the month observed.

Please see an example / fictitious data set:

require(data.table)

data <- data.table(month = rep(c('may', 'jun', 'jul'),  each = 5),
                   may.q1 = rep(c('yes', 'no', 'yes'),  each = 5),
                   jun.q1 = rep(c('breakfast', 'lunch', 'dinner'),  each = 5),
                   jul.q1 = rep(c('oranges', 'apples', 'oranges'),  each = 5),
                   may.q2 = rep(c('econ', 'math', 'science'), each = 5),
                   jun.q2 = rep(c('sunny', 'foggy', 'cloudy'), each = 5),
                   jul.q2 = rep(c('no rain', 'light mist', 'heavy rain'), each = 5))

In this survey, there are really only two questions: "q1" and "q2". Each of these questions is repeatedly asked for several months. However, the observation contains a valid response only if the month observed in the data matches up with the survey question for a particular month.

For example: "may.q1" is observed as "yes" for any observation in "May". I would like a new "Q1" variable to represent "may.q1", "jun.q1", and "jul.q1". The value of "Q1" will take on the value of "may.q1" when the month is "may", and the value of "Q1" will take on the value of "jun.q1" when the month is "jun".

If I were to try and do this by hand using data table, I would want something like:

mdata <- data[month == 'may', c('month', 'may.q1', 'may.q2'), with = F]
setnames(mdata, names(mdata), gsub('may\\.', '', names(mdata)))

I would want this repeated "by = month".

If I were to use the "plyr" package for a data frame, I would solve using the following approach:

require(plyr)
data <- data.frame(data)

mdata <- ddply(data, .(month), function(dfmo) {
    dfmo <- dfmo[, c(1, grep(dfmo$month[1], names(dfmo)))]
    names(dfmo) <- gsub(paste0(dfmo$month[1], '\\.'), '', names(dfmo))
    return(dfmo)
})

Any help using a data.table method would be greatly appreciated, as my data are large. Thank you.

like image 650
Andreas Avatar asked Apr 22 '13 18:04

Andreas


People also ask

How do you create a subset of a data set?

The most general way to subset a data frame by rows and/or columns is the base R Extract[] function, indicated by matched square brackets instead of the usual matched parentheses. For a data frame named d the general format is d[rows, columms] .

What are subset of table?

A Row Subset is a selection of the rows within a whole table being viewed within the application, or equivalently a new table composed from some subset of its rows. You can define these and use them in several different ways; the usefulness comes from defining them in one context and using them in another.

Which function is used to get the subset of data from the datasets?

Subsetting rows using the subset function The subset function with a logical statement will let you subset the data frame by observations.


2 Answers

A different way to illustrate :

data[, .SD[,paste0(month,c(".q1",".q2")), with=FALSE], by=month]

    month  may.q1     may.q2
 1:   may     yes       econ
 2:   may     yes       econ
 3:   may     yes       econ
 4:   may     yes       econ
 5:   may     yes       econ
 6:   jun   lunch      foggy
 7:   jun   lunch      foggy
 8:   jun   lunch      foggy
 9:   jun   lunch      foggy
10:   jun   lunch      foggy
11:   jul oranges heavy rain
12:   jul oranges heavy rain
13:   jul oranges heavy rain
14:   jul oranges heavy rain
15:   jul oranges heavy rain

But note the column names come from the first group (can rename afterwards using setnames). And it may not be the most efficient if there are a great number of columns with only a few needed. In that case Arun's solution melting to long format should be faster.

like image 138
Matt Dowle Avatar answered Sep 19 '22 15:09

Matt Dowle


Edit: Seems very inefficient on bigger data. Check out @MatthewDowle's answer for a really fast and neat solution.

Here's a solution using data.table.

dd <- melt.dt(data, id.var=c("month"))[month == gsub("\\..*$", "", ind)][, 
        ind := gsub("^.*\\.", "", ind)][, split(values, ind), by=list(month)]

The function melt.dt is a small function (still more improvements to be made) I wrote to melt a data.table similar to that of the melt function in plyr (copy/paste this function shown below before trying out the code above).

melt.dt <- function(DT, id.var) {
    stopifnot(inherits(DT, "data.table"))
    measure.var <- setdiff(names(DT), id.var)
    ind <- rep.int(measure.var, rep.int(nrow(DT), length(measure.var)))
    m1  <- lapply(c("list", id.var), as.name)
    m2  <- as.call(lapply(c("factor", "ind"), as.name))
    m3  <- as.call(lapply(c("c", measure.var), as.name))    
    quoted <- as.call(c(m1, ind = m2, values = m3))
    DT[, eval(quoted)]
}

The idea: First melt the data.table with id.var = month column. Now, all your melted column names are of the form month.question. So, by removing ".question" from this melted column and equating with month column, we can remove all unnecessary entries. Once we did this, we don't need the "month." in the melted column "ind" anymore. So, we use gsub to remove "month." to retain just q1, q2 etc.. After this, we have to reshape (or cast) it. This is done by grouping by month and splitting the values column by ind (which has either q1 or q2. So, you'll get 2 columns for every month (which is then stitched together) to get your desired output.

like image 43
Arun Avatar answered Sep 18 '22 15:09

Arun