Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

using reshape on with multiple time=" " variables?

Tags:

r

reshape

so here is what i am trying to do. i have a data set that has all outcomes listed in one column, but the step they were observed at and the method used to observe them are in separate columns. there are multiple sites which i am treating as a unique identifier. not every site has the same number of steps or methods and not every method is done at every step. for example, site a1 may have steps s1-s5 and method m1-m25 at each step, while site a9 may have steps s1-s15, but only methods m3-m9. basically, there can be missing data for a given site/step/method. not every site has every step, and not every step has every method. the raw data set looks a little like this:

site step   method  outcome
a1   S 1    m1      5
a1   S 1    m2      1
a1   S 2    m6      4
a2   S 1    m6      1a
a2   S 1    m4      3
a2   S 3    m7      2
a2   S 4    m2      7
a3   S 1    m1      2a
a3   S 1    m2      c11
a4   S 1    m4      2
a4   S 2    m2      5
a5   S 3    m3      6
a6   S 2    m1      7   
a6   S 3    m4      8   

outcome has some numeric and character values, depending on method

step is the only real "time" variable, but i feel that i need to have r treat method as one as well. the way the data is now there are lots of rows and just these few columns, and the way the data is set up now i am having trouble running any analysis on it.

i used reshape as follows( i have tried other reshape statements but this is one for example), wont let me use 2 time variables unfortuneately

    mydata<-reshape(rawdata,idvar="site",timevar="step",direction="wide")


  site method.S 1 outcome.S 1 method.S 2 outcome.S 2 method.S 3 outcome.S 3
1    a1         m1           5         m6           4       <NA>          NA
4    a2         m6           1       <NA>          NA         m7           2
8    a3         m1           2       <NA>          NA       <NA>          NA
10   a4         m4           2         m2           5       <NA>          NA
12   a5       <NA>          NA       <NA>          NA         m3           6
13   a6       <NA>          NA         m1           7         m4           8
   method.S 4 outcome.S 4
1        <NA>          NA
4          m2           7
8        <NA>          NA
10       <NA>          NA
12       <NA>          NA
13       <NA>          NA

this is the output from r

it is correct that i only want to end up with 1 row per site, and many columns (even if a site had nothing done at a particular step). i am trying to get 1 row per site and then the outcome column will go away with all its values beneath an appropriate column like so

site  S1.m1.outcome S1.m2.outcome S1.m3.outcome ................ S9.m10.outcome
a1        1               c4.5           NA                         3.6

so basically one column per step and method combination, and i know that is alot of columns, but it will make it much easier to compare between steps which is one of my goals my main point of doing this is to be able to, for a given method, test changes in outcomes between steps using t-tests and such for differences in means. i imagine there is an easier way to go about doing the tests but i am still new to r and haven't found one yet. thanks for any advice cheers

like image 602
user2117897 Avatar asked Feb 28 '13 04:02

user2117897


2 Answers

Here are two options, if I understand your desired output correctly. In the examples below, for the last steps, I've sorted the columns so they would match from using each output, and I've only shown the first few and the last few columns from the resulting data.frames so you can see that they are giving you the same results. In other words, you should be able to stop at "T2" and "T3" for your actual needs--the rest is just for demonstration.

Option 1: reshape twice

T1 <- reshape(rawdata, idvar = c("site", "method"),
              timevar = "step", direction = "wide")
T2 <- reshape(T1, direction = "wide", idvar = "site", timevar = "method")
dim(T2)
# [1]  6 25
names(T2)
#  [1] "site"          "outcome.S1.m1" "outcome.S2.m1" "outcome.S3.m1" "outcome.S4.m1"
#  [6] "outcome.S1.m2" "outcome.S2.m2" "outcome.S3.m2" "outcome.S4.m2" "outcome.S1.m6"
# [11] "outcome.S2.m6" "outcome.S3.m6" "outcome.S4.m6" "outcome.S1.m4" "outcome.S2.m4"
# [16] "outcome.S3.m4" "outcome.S4.m4" "outcome.S1.m7" "outcome.S2.m7" "outcome.S3.m7"
# [21] "outcome.S4.m7" "outcome.S1.m3" "outcome.S2.m3" "outcome.S3.m3" "outcome.S4.m3"
T2a <- T2[, order(names(T2))]
T2a[, c(1:3, 23:25)]
#    outcome.S1.m1 outcome.S1.m2 outcome.S1.m3 outcome.S4.m6 outcome.S4.m7 site
# 1              5             1          <NA>          <NA>          <NA>   a1
# 4           <NA>          <NA>          <NA>          <NA>          <NA>   a2
# 8             2a           c11          <NA>          <NA>          <NA>   a3
# 10          <NA>          <NA>          <NA>          <NA>          <NA>   a4
# 12          <NA>          <NA>          <NA>          <NA>          <NA>   a5
# 13          <NA>          <NA>          <NA>          <NA>          <NA>   a6

Option 2: Use dcast from "reshape2"

library(reshape2)
T3 <- dcast(rawdata, site ~ step + method, value.var = "outcome", drop = FALSE)
dim(T3)
# [1]  6 25
names(T3)
#  [1] "site"  "S1_m1" "S1_m2" "S1_m3" "S1_m4" "S1_m6" "S1_m7" "S2_m1" "S2_m2" "S2_m3"
# [11] "S2_m4" "S2_m6" "S2_m7" "S3_m1" "S3_m2" "S3_m3" "S3_m4" "S3_m6" "S3_m7" "S4_m1"
# [21] "S4_m2" "S4_m3" "S4_m4" "S4_m6" "S4_m7"
T3a <- T3[, order(names(T3))]
T3a[, c(1:3, 23:25)]
#   S1_m1 S1_m2 S1_m3 S4_m6 S4_m7 site
# 1     5     1  <NA>  <NA>  <NA>   a1
# 2  <NA>  <NA>  <NA>  <NA>  <NA>   a2
# 3    2a   c11  <NA>  <NA>  <NA>   a3
# 4  <NA>  <NA>  <NA>  <NA>  <NA>   a4
# 5  <NA>  <NA>  <NA>  <NA>  <NA>   a5
# 6  <NA>  <NA>  <NA>  <NA>  <NA>   a6

Both use the following as the input for "rawdata"

rawdata <- structure(list(site = c("a1", "a1", "a1", "a2", "a2", "a2", "a2", 
"a3", "a3", "a4", "a4", "a5", "a6", "a6"), step = c("S1", "S1", 
"S2", "S1", "S1", "S3", "S4", "S1", "S1", "S1", "S2", "S3", "S2", 
"S3"), method = c("m1", "m2", "m6", "m6", "m4", "m7", "m2", "m1", 
"m2", "m4", "m2", "m3", "m1", "m4"), outcome = c("5", "1", "4", 
"1a", "3", "2", "7", "2a", "c11", "2", "5", "6", "7", "8")), .Names = c("site", 
"step", "method", "outcome"), row.names = c(NA, -14L), class = "data.frame")
like image 200
A5C1D2H2I1M1N2O1R2T1 Avatar answered Oct 23 '22 12:10

A5C1D2H2I1M1N2O1R2T1


As I had a similar problem. You could just merge your two time variables into one. using the same rawdata from the other answer you could just use

rawdata<-unite(rawdata, timevar, step, method)
reshape(rawdata, direction = "wide", idvar="site",timevar = "timevar")

This helped my understanding of the wide -> long process a lot

like image 34
Max M Avatar answered Oct 23 '22 10:10

Max M