Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Wrapping base R reshape for ease-of-use

Tags:

r

reshape

It is a truth universally acknowledged that R's base reshape command is speedy and powerful but has miserable syntax. I have therefore written a quick wrapper around it which I will throw into the next release of the taRifx package. Before I did that, however, I want to solicit improvements.

Here's my version, with updates from @RichieCotton:

# reshapeasy: Version of reshape with way, way better syntax
 # Written with the help of the StackOverflow R community
 # x is a data.frame to be reshaped
 # direction is "wide" or "long"
 # vars are the names of the (stubs of) the variables to be reshaped (if omitted, defaults to everything not in id or vary)
 # id are the names of the variables that identify unique observations
 # vary is the variable that varies.  Going to wide this variable will cease to exist.  Going to long it will be created.
 # omit is a vector of characters which are to be omitted if found at the end of variable names (e.g. price_1 becomes price in long)
 # ... are options to be passed to stats::reshape
reshapeasy <- function( data, direction, id=(sapply(data,is.factor) | sapply(data,is.character)), vary=sapply(data,is.numeric), omit=c("_","."), vars=NULL, ... ) {
  if(direction=="wide") data <- stats::reshape( data=data, direction=direction, idvar=id, timevar=vary, ... )
  if(direction=="long") {
    varying <- which(!(colnames(data) %in% id))
    data <- stats::reshape( data=data, direction=direction, idvar=id, varying=varying, timevar=vary, ... )
  }
  colnames(data) <- gsub( paste("[",paste(omit,collapse="",sep=""),"]$",sep=""), "", colnames(data) )
  return(data)
}

Note that you can move from wide to long without changing the options other than the direction. To me, this is the key to usability.

I'm happy to give acknowledgement in the function help files for any substantial improvements if you chat or e-mail me your info.

Improvements might fall in the following areas:

  • Naming the function and its arguments
  • Making it more general (currently it handles a fairly specific case, which I believe to be by far the most common, but it has not yet exhausted the capabilities of stats::reshape)
  • Code improvements

Examples

Sample data

x.wide <- structure(list(surveyNum = 1:6, pio_1 = structure(c(2L, 2L, 1L, 
2L, 1L, 1L), .Names = c("1", "2", "3", "4", "5", "6"), .Label = c("1", 
"2"), class = "factor"), pio_2 = structure(c(2L, 1L, 2L, 1L, 
2L, 2L), .Names = c("1", "2", "3", "4", "5", "6"), .Label = c("1", 
"2"), class = "factor"), pio_3 = structure(c(2L, 2L, 1L, 1L, 
2L, 1L), .Names = c("1", "2", "3", "4", "5", "6"), .Label = c("1", 
"2"), class = "factor"), caremgmt_1 = structure(c(2L, 1L, 1L, 
2L, 1L, 2L), .Names = c("1", "2", "3", "4", "5", "6"), .Label = c("1", 
"2"), class = "factor"), caremgmt_2 = structure(c(1L, 2L, 2L, 
2L, 2L, 1L), .Names = c("1", "2", "3", "4", "5", "6"), .Label = c("1", 
"2"), class = "factor"), caremgmt_3 = structure(c(1L, 2L, 1L, 
2L, 1L, 1L), .Names = c("1", "2", "3", "4", "5", "6"), .Label = c("1", 
"2"), class = "factor"), prev_1 = structure(c(1L, 2L, 2L, 1L, 
1L, 2L), .Names = c("1", "2", "3", "4", "5", "6"), .Label = c("1", 
"2"), class = "factor"), prev_2 = structure(c(2L, 2L, 1L, 2L, 
1L, 1L), .Names = c("1", "2", "3", "4", "5", "6"), .Label = c("1", 
"2"), class = "factor"), prev_3 = structure(c(2L, 1L, 2L, 2L, 
1L, 1L), .Names = c("1", "2", "3", "4", "5", "6"), .Label = c("1", 
"2"), class = "factor"), price_1 = structure(c(2L, 1L, 2L, 5L, 
3L, 4L), .Names = c("1", "2", "3", "4", "5", "6"), .Label = c("1", 
"2", "3", "4", "5", "6"), class = "factor"), price_2 = structure(c(6L, 
5L, 5L, 4L, 4L, 2L), .Names = c("1", "2", "3", "4", "5", "6"), .Label = c("1", 
"2", "3", "4", "5", "6"), class = "factor"), price_3 = structure(c(3L, 
5L, 2L, 5L, 4L, 5L), .Names = c("1", "2", "3", "4", "5", "6"), .Label = c("1", 
"2", "3", "4", "5", "6"), class = "factor")), .Names = c("surveyNum", 
"pio_1", "pio_2", "pio_3", "caremgmt_1", "caremgmt_2", "caremgmt_3", 
"prev_1", "prev_2", "prev_3", "price_1", "price_2", "price_3"
), idvars = "surveyNum", rdimnames = list(structure(list(surveyNum = 1:24), .Names = "surveyNum", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", 
"14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24"
), class = "data.frame"), structure(list(variable = structure(c(1L, 
1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L), .Label = c("pio", 
"caremgmt", "prev", "price"), class = "factor"), .id = c(1L, 
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L)), .Names = c("variable", 
".id"), row.names = c("pio_1", "pio_2", "pio_3", "caremgmt_1", 
"caremgmt_2", "caremgmt_3", "prev_1", "prev_2", "prev_3", "price_1", 
"price_2", "price_3"), class = "data.frame")), row.names = c(NA, 
6L), class = c("cast_df", "data.frame"))

x.long <- structure(list(.id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), pio = structure(c(2L, 
2L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 
2L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 
1L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 
1L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 
1L, 2L, 2L, 1L, 2L, 1L, 1L), .Label = c("1", "2"), class = "factor"), 
    caremgmt = structure(c(2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 
    2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 
    1L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 
    1L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 
    1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 
    1L, 2L, 2L), .Label = c("1", "2"), class = "factor"), prev = structure(c(1L, 
    2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 
    1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 
    2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 
    2L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 
    1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 2L), .Label = c("1", 
    "2"), class = "factor"), price = structure(c(2L, 1L, 2L, 
    5L, 3L, 4L, 1L, 5L, 4L, 3L, 1L, 2L, 6L, 6L, 5L, 4L, 6L, 3L, 
    5L, 6L, 3L, 1L, 2L, 4L, 3L, 5L, 2L, 5L, 4L, 5L, 6L, 6L, 4L, 
    6L, 4L, 1L, 2L, 3L, 1L, 2L, 2L, 5L, 1L, 6L, 1L, 3L, 4L, 3L, 
    6L, 5L, 5L, 4L, 4L, 2L, 2L, 2L, 6L, 3L, 1L, 4L, 4L, 5L, 1L, 
    3L, 6L, 1L, 3L, 5L, 1L, 3L, 6L, 2L), .Label = c("1", "2", 
    "3", "4", "5", "6"), class = "factor"), surveyNum = c(1L, 
    2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 
    15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 1L, 2L, 
    3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 
    16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 1L, 2L, 3L, 
    4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 
    17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L)), .Names = c(".id", 
"pio", "caremgmt", "prev", "price", "surveyNum"), row.names = c(NA, 
-72L), class = "data.frame")

Examples

> x.wide
  surveyNum pio_1 pio_2 pio_3 caremgmt_1 caremgmt_2 caremgmt_3 prev_1 prev_2 prev_3 price_1 price_2 price_3
1         1     2     2     2          2          1          1      1      2      2       2       6       3
2         2     2     1     2          1          2          2      2      2      1       1       5       5
3         3     1     2     1          1          2          1      2      1      2       2       5       2
4         4     2     1     1          2          2          2      1      2      2       5       4       5
5         5     1     2     2          1          2          1      1      1      1       3       4       4
6         6     1     2     1          2          1          1      2      1      1       4       2       5
> reshapeasy( x.wide, "long", NULL, id="surveyNum", vary="id", sep="_" )
    surveyNum id pio caremgmt prev price
1.1         1  1   2        2    1     2
2.1         2  1   2        1    2     1
3.1         3  1   1        1    2     2
4.1         4  1   2        2    1     5
5.1         5  1   1        1    1     3
6.1         6  1   1        2    2     4
1.2         1  2   2        1    2     6
2.2         2  2   1        2    2     5
3.2         3  2   2        2    1     5
4.2         4  2   1        2    2     4
5.2         5  2   2        2    1     4
6.2         6  2   2        1    1     2
1.3         1  3   2        1    2     3
2.3         2  3   2        2    1     5
3.3         3  3   1        1    2     2
4.3         4  3   1        2    2     5
5.3         5  3   2        1    1     4
6.3         6  3   1        1    1     5

> head(x.long)
  .id pio caremgmt prev price surveyNum
1   1   2        2    1     2         1
2   1   2        1    2     1         2
3   1   1        1    2     2         3
4   1   2        2    1     5         4
5   1   1        1    1     3         5
6   1   1        2    2     4         6

> head(reshapeasy( x.long, direction="wide", id="surveyNum", vary=".id" ))
  surveyNum pio.1 caremgmt.1 prev.1 price.1 pio.3 caremgmt.3 prev.3 price.3 pio.2 caremgmt.2 prev.2 price.2
1         1     2          2      1       2     2          1      2       3     2          1      2       6
2         2     2          1      2       1     2          2      1       5     1          2      2       5
3         3     1          1      2       2     1          1      2       2     2          2      1       5
4         4     2          2      1       5     1          2      2       5     1          2      2       4
5         5     1          1      1       3     2          1      1       4     2          2      1       4
6         6     1          2      2       4     1          1      1       5     2          1      1       2
like image 677
Ari B. Friedman Avatar asked Apr 07 '12 15:04

Ari B. Friedman


2 Answers

Perhaps for those who are lazy and don't like to type the variable names, you can add the following to the head of your function:

  if (is.numeric(id) == 1) {
    id = colnames(data)[id]
  } else if (is.numeric(id) == 0) {
    id = id
  }

  if (is.numeric(vary) == 1) {
    vary = colnames(data)[vary]
  } else if (is.numeric(vary) == 0) {
    vary = vary
  }

Then, following with your examples, you can use the following shorthand:

reshapeasy(x.wide, direction="long", id=1, sep="_", vary="id")
reshapeasy(x.long, direction="wide", id=6, vary=1)

(I know, it might not be good practice since the code might be less readable or less easily understandable by someone later on, but it does happen frequently.)

like image 67
A5C1D2H2I1M1N2O1R2T1 Avatar answered Dec 13 '22 05:12

A5C1D2H2I1M1N2O1R2T1


Some initial thoughts:

I've always thought that the direction commands "wide" and "long" were a little fuzzy. Do they mean you want to convert the data to that format, or that the data is already in that format? It is something that you need to learn or look up. You can avoid that problem by having to separate functions reshapeToWide and reshapeToLong. As a bonus, the signature of each function has one less argument.


I don't think you meant to include the line

varying <- which(!(colnames(x.wide) %in% "surveyNum"))

since it refers to a specific dataset.


I prefer data to x for the first argument since it makes it clear that the input should be a data frame.


It is generally better form to have arguments without defaults first. So vars should come after id and vary.


Can you pick defaults for id and vary? reshape::melt defaults to factor and character columns for id and numeric columns for vary.

like image 21
Richie Cotton Avatar answered Dec 13 '22 05:12

Richie Cotton