A function I wrote to widen a long table of repeated multivariate time series data for input to classifier functions seems to cause erroneous results even for easy test data, but I can't locate the issue.
I am keeping a bunch of repeated trials of multivariate time series in a long data.table format like this, for speed and ease of use with most R idioms:
> this.data
Time Trial Class Channel Value
1: -100.00000 1 -1 V1 0.4551513
2: -96.07843 2 -1 V1 0.8241555
3: -92.15686 3 -1 V1 0.7667328
4: -88.23529 4 -1 V1 0.7475106
5: -84.31373 5 -1 V1 0.9810273
---
204796: 884.31373 196 1 V4 50.2642220
204797: 888.23529 197 1 V4 50.5747661
204798: 892.15686 198 1 V4 50.5749421
204799: 896.07843 199 1 V4 50.1988299
204800: 900.00000 200 1 V4 50.7756015
Specifically, the above data has a Time
column with 256 unique numbers from 0 to 900, which is repeated for each Channel
, for each Trial
. Similarly, each Channel
is one of V1,V2,V3,V4
, repeated for each Time
sample, for each Trial
. In other words, any combination of Time,Trial,Channel
uniquely specifies a Value
. In order to keep things simple, all Trial
s under 100 have Class
-1, and all above 99 have Class
1. (For testing purposes, all Value
s in Class
1 have a mean of 50, while those in Class
0 have a mean of 0. (This data can be generated and tweaked using the dummy.plug()
function included in a gist I made.)
In order to process the data using different machine learning classification algorithms, it seems to be necessary to reshape the data to something a little bit wider, so that each of the time series has its own column, while the others remain as ids. (For example, the stepwise classifier stepclass
from klaR
needs the features in different columns, so it can choose which ones to drop or add to its model as it trains.) Since there are repeated trials, I have not had success making existing functions like the cast
family work, and so I wrote my own:
##### converting from long table form to channel-split wide form #####
# for multivariate repeated time series
channel.form <- function(input.table,
value.col = "Voltage",
split.col = "Channel",
class.col = "Class",
time.col = "Time",
trial.col = "Trial") {
# Converts long table format to slightly wider format split by channels.
# For epoched datasets.
setkeyv(input.table, class.col)
chan.split <- split(input.table,input.table[,get(split.col)])
chan.d <- cbind(lapply(chan.split, function(x){
x[,value.col,with=FALSE]}))
chan.d <- as.data.table(matrix(unlist(chan.d),
ncol = input.table[,length(unique(get(split.col)))],
byrow=TRUE))
# reintroduce class labels
# since the split is over identical sections for each channel, we can just use
# the first split's labels
chan.d <- chan.d[,c(class.col):= chan.split[[1]][,get(class.col)]]
chan.d[,c(class.col):=as.factor(get(class.col))]
# similarly with time and trial labels
chan.d <- chan.d[,Time:= chan.split[[1]][,get(time.col)]]
chan.d <- chan.d[,Trial:= chan.split[[1]][,get(trial.col)]]
return(chan.d)
}
Using this function, I take some multivariate trials that I have prepared into a long data.table
like the one at the top, and reshape them into a wider one that looks like this:
> this.data.training.channel
V1 V2 V3 V4 Class Time Trial
1: -50.58389 -50.56397 -50.74251 -50.86700 -1 -100.00000 1
2: -50.92713 -50.28009 -50.15078 -50.70161 -1 -96.07843 2
3: -50.84276 -50.02456 -50.20015 -50.45228 -1 -76.47059 7
4: -50.68679 -50.05475 -50.04270 -50.83900 -1 -72.54902 8
5: -50.55954 -50.88998 -50.01273 -50.86856 -1 -68.62745 9
---
35836: 49.52361 49.37465 49.73997 49.10543 1 876.47059 194
35837: 49.93162 49.38352 49.62406 49.16854 1 888.23529 197
35838: 49.67510 49.63853 49.54259 49.81198 1 892.15686 198
35839: 49.26295 49.98449 49.60437 49.03918 1 896.07843 199
35840: 49.05030 49.42035 49.48546 49.73438 1 900.00000 200
At this point, I take the widened table and give it to a classifier like lda()
, then test it on a separate random portion of the same data:
lda.model <- lda(Class ~ . -Trial, this.data.training.channel)
lda.pred <- predict(lda.model, this.data.testing.channel)
However, even if I generate obscenely separated dummy data (see picture below), I am getting near-chance results with existing reasonable libraries. (I know the libraries are probably not at fault, because if I allow the algorithm to use the trial index as a training feature, it correctly classifies every input.)
> table(predicted = lda.pred$class, data = this.data.testing.channel[,Class])
data
predicted -1 1
-1 2119 1878
1 5817 5546
> 1-sum(lda.pred$class != this.data.testing.channel[,Class])/length(lda.pred$class)
[1] 0.4984375
> table(predicted = sda.pred$class, data = this.data.testing.channel[,Class])
data
predicted -1 1
-1 3705 3969
1 3719 3967
> 1-sum(sda.pred$class != this.data.testing.channel[,Class])/length(sda.pred$class)
[1] 0.4994792
The error rate is basically a coin flip, despite the values from class 1
being about 50 times the values from class -1
. I have to be making some huge mistake (which I think is a programming one, otherwise I would be over on cross validated), but I have spent days prodding it and rewriting code with no improvement. (As an example, note that I get the same result whether or not I scale the input values so that they have mean 0, variance 1.)
A complete gist that can be run to reproduce the problem is available here.
(see previous revisions of the question for the full list, due to length considerations)
I wrote a function (included in the gist) to generate easily separable dummy data, and wrote another function to average each of the two classes, faceted by Channel
and colored by Class
, like the plot above. Playing with each of the parameters (difference in population means, channel count, etc.) seems to produce expected output, as well as peeking at appropriate subsets using calls like this.data[Trial==1,unique(Time),by=Subject]
.
I would greatly appreciate any advice on fixing this. I just can't see what I'm doing wrong.
If someone either diagnosed/located the issue, or was able to illustrate, using a different approach, a reshaped table from the data that worked with these (popular) classifier functions, I wouldn't just accept, I would award a bounty (after testing, of course).
R version 3.0.2 (2013-09-25)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel grid stats graphics grDevices utils datasets methods
[9] base
other attached packages:
[1] doMC_1.3.2 iterators_1.0.6 AUC_0.3.0
[4] LiblineaR_1.80-7 RcppRoll_0.1.0 RcppArmadillo_0.4.300.0
[7] Rcpp_0.11.1 foreach_1.4.1 cvTools_0.3.2
[10] robustbase_0.90-2 latticist_0.9-44 vcd_1.3-1
[13] latticeExtra_0.6-26 lattice_0.20-29 pheatmap_0.7.7
[16] RColorBrewer_1.0-5 klaR_0.6-10 MASS_7.3-29
[19] ggplot2_0.9.3.1 reshape2_1.2.2 data.table_1.9.2
[22] sda_1.3.3 fdrtool_1.2.12 corpcor_1.6.6
[25] entropy_1.2.0 zoo_1.7-11 testthat_0.8
loaded via a namespace (and not attached):
[1] codetools_0.2-8 colorspace_1.2-4 combinat_0.0-8 compiler_3.0.2 DEoptimR_1.0-1
[6] dichromat_2.0-0 digest_0.6.4 gtable_0.1.2 gWidgets_0.0-52 labeling_0.2
[11] munsell_0.4.2 plyr_1.8 proto_0.3-10 scales_0.2.3 stringr_0.6.2
[16] tools_3.0.2
I could not reproduce your error and I found some problems with dummy.plug()
. I generated data with
library(data.table)
library(reshape2)
library("MASS")
set.seed(115)
pp<-dummy.plug(trial.count = 200,
chan.count = 4,
mean.diff = 100,
value.name = "Value")
And I don't care for data.table so i just converted it to a basic data.frame.
dd<-as.data.frame(pp)
Now you say that Time
, Trial
, and Channel
should uniquely identify a value, but that does not seem to be the case in the dummy data. I see that
subset(dd, Time==-100 & Trial==1 & Channel=="V1")
# Time Trial Class Channel Value
# 1 -100 1 -1 V1 0.73642916
# 6401 -100 1 -1 V1 0.17648939
# 12801 -100 1 -1 V1 0.41366964
# 19201 -100 1 -1 V1 0.07044473
# 25601 -100 1 -1 V1 0.86583284
# 32001 -100 1 -1 V1 0.24255411
# 38401 -100 1 -1 V1 0.92473225
# 44801 -100 1 -1 V1 0.69989600
So there are clearly multiple values for each combination. So to proceed, I decided just to take the mean of the observed values. I had no problems using dcast
with
xx<-dcast(dd, Class+Time+Trial~Channel, fun.aggregate=mean)
Then I split up the training/test datasets
train.trials = sample(unique(dd$Trial), 140)
train.data = subset(xx, Trial %in% train.trials)
test.data = subset(xx, !Trial %in% train.trials)
Then I ran lda as above
lda.model <- lda(Class ~ . -Trial, train.data)
lda.pred <- predict(lda.model, test.data)
And I checked out how I did
table(lda.pred$class, test.data$Class)
# -1 1
# -1 704 0
# 1 0 1216
And I appear to do much better than you did.
Unless something bad happened when i converted the data.table to a data.frame, there seems to be problems with your test data. Perhaps there is a problem with your non-cast reshape function. Seeing as how dcast
works just fine, maybe you want to check that your function works as well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With