Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What am I doing wrong with this long-to-wide reshape?

the problem

A function I wrote to widen a long table of repeated multivariate time series data for input to classifier functions seems to cause erroneous results even for easy test data, but I can't locate the issue.

background

I am keeping a bunch of repeated trials of multivariate time series in a long data.table format like this, for speed and ease of use with most R idioms:

> this.data
              Time Trial Class Channel      Value
     1: -100.00000     1    -1      V1  0.4551513
     2:  -96.07843     2    -1      V1  0.8241555
     3:  -92.15686     3    -1      V1  0.7667328
     4:  -88.23529     4    -1      V1  0.7475106
     5:  -84.31373     5    -1      V1  0.9810273
    ---                                          
204796:  884.31373   196     1      V4 50.2642220
204797:  888.23529   197     1      V4 50.5747661
204798:  892.15686   198     1      V4 50.5749421
204799:  896.07843   199     1      V4 50.1988299
204800:  900.00000   200     1      V4 50.7756015

Specifically, the above data has a Time column with 256 unique numbers from 0 to 900, which is repeated for each Channel, for each Trial. Similarly, each Channel is one of V1,V2,V3,V4, repeated for each Time sample, for each Trial. In other words, any combination of Time,Trial,Channel uniquely specifies a Value. In order to keep things simple, all Trials under 100 have Class -1, and all above 99 have Class 1. (For testing purposes, all Values in Class 1 have a mean of 50, while those in Class 0 have a mean of 0. (This data can be generated and tweaked using the dummy.plug() function included in a gist I made.)

In order to process the data using different machine learning classification algorithms, it seems to be necessary to reshape the data to something a little bit wider, so that each of the time series has its own column, while the others remain as ids. (For example, the stepwise classifier stepclass from klaR needs the features in different columns, so it can choose which ones to drop or add to its model as it trains.) Since there are repeated trials, I have not had success making existing functions like the cast family work, and so I wrote my own:

##### converting from long table form to channel-split wide form #####
# for multivariate repeated time series
channel.form <- function(input.table,
                         value.col = "Voltage",
                         split.col = "Channel",
                         class.col = "Class",
                         time.col = "Time",
                         trial.col = "Trial") {
# Converts long table format to slightly wider format split by channels.
# For epoched datasets.

  setkeyv(input.table, class.col)

  chan.split <- split(input.table,input.table[,get(split.col)])

  chan.d <- cbind(lapply(chan.split, function(x){
    x[,value.col,with=FALSE]}))

  chan.d <- as.data.table(matrix(unlist(chan.d),
                            ncol = input.table[,length(unique(get(split.col)))], 
                            byrow=TRUE))

  # reintroduce class labels
  # since the split is over identical sections for each channel, we can just use
  # the first split's labels
  chan.d <- chan.d[,c(class.col):= chan.split[[1]][,get(class.col)]]
  chan.d[,c(class.col):=as.factor(get(class.col))]

  # similarly with time and trial labels
  chan.d <- chan.d[,Time:= chan.split[[1]][,get(time.col)]]
  chan.d <- chan.d[,Trial:= chan.split[[1]][,get(trial.col)]]

  return(chan.d) 
}

Using this function, I take some multivariate trials that I have prepared into a long data.table like the one at the top, and reshape them into a wider one that looks like this:

> this.data.training.channel
              V1        V2        V3        V4 Class       Time Trial
    1: -50.58389 -50.56397 -50.74251 -50.86700    -1 -100.00000     1
    2: -50.92713 -50.28009 -50.15078 -50.70161    -1  -96.07843     2
    3: -50.84276 -50.02456 -50.20015 -50.45228    -1  -76.47059     7
    4: -50.68679 -50.05475 -50.04270 -50.83900    -1  -72.54902     8
    5: -50.55954 -50.88998 -50.01273 -50.86856    -1  -68.62745     9
   ---                                                               
35836:  49.52361  49.37465  49.73997  49.10543     1  876.47059   194
35837:  49.93162  49.38352  49.62406  49.16854     1  888.23529   197
35838:  49.67510  49.63853  49.54259  49.81198     1  892.15686   198
35839:  49.26295  49.98449  49.60437  49.03918     1  896.07843   199
35840:  49.05030  49.42035  49.48546  49.73438     1  900.00000   200

At this point, I take the widened table and give it to a classifier like lda(), then test it on a separate random portion of the same data:

lda.model <- lda(Class ~ . -Trial, this.data.training.channel)
lda.pred <- predict(lda.model, this.data.testing.channel)

symptoms

However, even if I generate obscenely separated dummy data (see picture below), I am getting near-chance results with existing reasonable libraries. (I know the libraries are probably not at fault, because if I allow the algorithm to use the trial index as a training feature, it correctly classifies every input.)

hugely separated two-class data in four channels

> table(predicted = lda.pred$class, data = this.data.testing.channel[,Class])
         data
predicted   -1    1
       -1 2119 1878
       1  5817 5546

> 1-sum(lda.pred$class != this.data.testing.channel[,Class])/length(lda.pred$class)
[1] 0.4984375

> table(predicted = sda.pred$class, data = this.data.testing.channel[,Class])
         data
predicted   -1    1
       -1 3705 3969
       1  3719 3967

> 1-sum(sda.pred$class != this.data.testing.channel[,Class])/length(sda.pred$class)
[1] 0.4994792

The error rate is basically a coin flip, despite the values from class 1 being about 50 times the values from class -1. I have to be making some huge mistake (which I think is a programming one, otherwise I would be over on cross validated), but I have spent days prodding it and rewriting code with no improvement. (As an example, note that I get the same result whether or not I scale the input values so that they have mean 0, variance 1.)

reproducing the problem

A complete gist that can be run to reproduce the problem is available here.

possible problems I considered, what I tried

(see previous revisions of the question for the full list, due to length considerations)

I wrote a function (included in the gist) to generate easily separable dummy data, and wrote another function to average each of the two classes, faceted by Channel and colored by Class, like the plot above. Playing with each of the parameters (difference in population means, channel count, etc.) seems to produce expected output, as well as peeking at appropriate subsets using calls like this.data[Trial==1,unique(Time),by=Subject].

what do I need to solve this?

I would greatly appreciate any advice on fixing this. I just can't see what I'm doing wrong.

If someone either diagnosed/located the issue, or was able to illustrate, using a different approach, a reshaped table from the data that worked with these (popular) classifier functions, I wouldn't just accept, I would award a bounty (after testing, of course).

session info

R version 3.0.2 (2013-09-25)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  grid      stats     graphics  grDevices utils     datasets  methods  
[9] base     

other attached packages:
 [1] doMC_1.3.2              iterators_1.0.6         AUC_0.3.0              
 [4] LiblineaR_1.80-7        RcppRoll_0.1.0          RcppArmadillo_0.4.300.0
 [7] Rcpp_0.11.1             foreach_1.4.1           cvTools_0.3.2          
[10] robustbase_0.90-2       latticist_0.9-44        vcd_1.3-1              
[13] latticeExtra_0.6-26     lattice_0.20-29         pheatmap_0.7.7         
[16] RColorBrewer_1.0-5      klaR_0.6-10             MASS_7.3-29            
[19] ggplot2_0.9.3.1         reshape2_1.2.2          data.table_1.9.2       
[22] sda_1.3.3               fdrtool_1.2.12          corpcor_1.6.6          
[25] entropy_1.2.0           zoo_1.7-11              testthat_0.8           

loaded via a namespace (and not attached):
 [1] codetools_0.2-8  colorspace_1.2-4 combinat_0.0-8   compiler_3.0.2   DEoptimR_1.0-1  
 [6] dichromat_2.0-0  digest_0.6.4     gtable_0.1.2     gWidgets_0.0-52  labeling_0.2    
[11] munsell_0.4.2    plyr_1.8         proto_0.3-10     scales_0.2.3     stringr_0.6.2   
[16] tools_3.0.2   
like image 654
bright-star Avatar asked Oct 21 '22 08:10

bright-star


1 Answers

I could not reproduce your error and I found some problems with dummy.plug(). I generated data with

library(data.table)
library(reshape2)
library("MASS")

set.seed(115)
pp<-dummy.plug(trial.count = 200,
    chan.count = 4,
    mean.diff = 100,
    value.name = "Value")

And I don't care for data.table so i just converted it to a basic data.frame.

dd<-as.data.frame(pp)

Now you say that Time, Trial, and Channel should uniquely identify a value, but that does not seem to be the case in the dummy data. I see that

subset(dd, Time==-100 & Trial==1 & Channel=="V1")

#       Time Trial Class Channel      Value
# 1     -100     1    -1      V1 0.73642916
# 6401  -100     1    -1      V1 0.17648939
# 12801 -100     1    -1      V1 0.41366964
# 19201 -100     1    -1      V1 0.07044473
# 25601 -100     1    -1      V1 0.86583284
# 32001 -100     1    -1      V1 0.24255411
# 38401 -100     1    -1      V1 0.92473225
# 44801 -100     1    -1      V1 0.69989600

So there are clearly multiple values for each combination. So to proceed, I decided just to take the mean of the observed values. I had no problems using dcast with

xx<-dcast(dd, Class+Time+Trial~Channel, fun.aggregate=mean)

Then I split up the training/test datasets

train.trials = sample(unique(dd$Trial), 140)
train.data = subset(xx, Trial %in% train.trials)
test.data = subset(xx, !Trial %in% train.trials)

Then I ran lda as above

lda.model <- lda(Class ~ . -Trial, train.data)
lda.pred <- predict(lda.model, test.data)

And I checked out how I did

table(lda.pred$class, test.data$Class)
#        -1    1
#   -1  704    0
#   1     0 1216

And I appear to do much better than you did.

Unless something bad happened when i converted the data.table to a data.frame, there seems to be problems with your test data. Perhaps there is a problem with your non-cast reshape function. Seeing as how dcast works just fine, maybe you want to check that your function works as well.

like image 63
MrFlick Avatar answered Oct 28 '22 21:10

MrFlick