Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to present multiple time-series data to an SVM (ksvm) in R (or, How to present two-dimensional input data to an SVM)

How can I make a ksvm model aware that the first 100 numbers in a dataset are all time series data from one sensor, while the next 100 numbers are all time series data from another sensor, etc, for six separate time series sensor inputs? Alternatively (and perhaps more generally), how can I present two-dimensional input data to an SVM?

The process for which I need a binary yes/no prediction model has six non-periodic time series inputs, all with the same sampling frequency. An event triggers the start of data collection, and after a pre-determined time I need a yes/no prediction (preferably including a probability-of-correctness output). The characteristics of the time-series inputs which should produce 'yes' vs. 'no' are not known, but what is known is that there should be some correlation between each of the the input time series data and the final outcome. There is also significant noise present on all inputs. Both the meaningful information as well as the noise appear on the inputs as short-duration bursts (the meaningful bursts are always in the same general time for a given input source), but identifying which bursts are meaningful and which are noise is difficult; i.e. the fact that a burst happened at the "right" time for one input does not necessarily indicate a "yes" output; it may just be noise. To know whether the prediction should be "yes", the model needs to somehow incorporate information from all six time series inputs. I have a collection of prior data with approximately 900 'no' results and 100 'yes' results.

I'm pretty new to both R and SVM's, but I think I want to use an SVM model (kernlab's ksvm). I'm having trouble figuring out how to present the input data to it. I'm also not sure how to tell ksvm that the data is time series data, or if that is even relevant. I've tried using the Rattle GUI front-end to R to pull in my data from csv files, but I can't figure out how to present the time series data from all six inputs into the ksvm model. As a csv-file input, it seems the only way to import the data for all 1000 samples is by organizing the input data such that all sample data (for all six time series inputs) is on a single line of the csv file, with a separate known-result file's data presented on each line of the csv file. But in doing so, the fact that the 1st, 2nd, 3rd, etc. numbers are each part of the time series data from the first sensor is lost in the translation, as well as the fact that the 101st, 102nd, 103rd, etc. numbers are each part of the time series data from the second sensor, and so on; to the ksvm model, each data sample is just considered an isolated number unrelated to its neighbor. How can I present this data to ksvm as six separate but interrelated time series arrays? Or how can I present a 2-dimensional array of data to ksvm?


UPDATE:

OK, there are two basic strategies I've tried with dismal results (well, the resulting models were better than blind guessing, but not much).

First of all, not being familiar with R, I used the Rattle GUI front-end to R. I have a feeling that by doing so I may be limiting my options. But anyway, here's what I've done.....

Example known result files (shown with only 4 sensors instead of 6, and only 7 time samples instead of 100):

training168_yes.csv

Seconds Since 1/1/2000,sensor1,sensor2,sensor3,sensor4
454768042.4,           0,      0,      0,      0
454768042.6,           51,     60,     0,      172
454768043.3,           0,      0,      0,      0
454768043.7,           300,    0,      0,      37
454768044.0,           0,      0,      1518,   0
454768044.3,           0,      0,      0,      0
454768044.7,           335,    0,      0,      4273

training169_no.csv

Seconds Since 1/1/2000,sensor1,sensor2,sensor3,sensor4
454767904.5,           0,      0,      0,      0
454767904.8,           51,     0,      498,    0
454767905.0,           633,    0,      204,    55
454767905.3,           0,      0,      0,      512
454767905.6,           202,    655,    739,    656
454767905.8,           0,      0,      0,      0
454767906.0,           0,      934,    0,      7814

The only way I know to get the data for all training samples into R/Rattle is to massage & combine all result files into a single .csv file, with one sample result per line. I can think of only two ways to do that, so I tried them both (and I knew when I was doing it that by doing this I'm hiding potentially important information, which is the point of this SO question):

TRIAL #1: For each result file, add each sensor's samples into a single number, blasting away all temporal information:

result,sensor1,sensor2,sensor3,sensor4
no,    886,    1589,   1441,   9037
yes,   686,    60,     1518,   4482
no,    632,    1289,   1173,   9152
yes,   411,    67,     988,    5030
no,    772,    1703,   1351,   9008
yes,   490,    70,     1348,   4909

When I get done using Rattle to generate the SVM, Rattle's log tab gives me the following script which can be used to generate & train an SVM in RGui:

library(rattle)
building <- TRUE
scoring  <- ! building
library(colorspace)
crv$seed <- 42 
crs$dataset <- read.csv("file:///C:/Users/mminich/Desktop/stackoverflow/trainingSummary1.csv", na.strings=c(".", "NA", "", "?"), strip.white=TRUE, encoding="UTF-8")
set.seed(crv$seed) 
crs$nobs <- nrow(crs$dataset) # 6 observations 
crs$sample <- crs$train <- sample(nrow(crs$dataset), 0.67*crs$nobs) # 4 observations
crs$validate <- NULL
crs$test <- setdiff(setdiff(seq_len(nrow(crs$dataset)), crs$train), crs$validate) # 2 observations
# The following variable selections have been noted.
crs$input <- c("sensor1", "sensor2", "sensor3", "sensor4")
crs$numeric <- c("sensor1", "sensor2", "sensor3", "sensor4")
crs$categoric <- NULL
crs$target  <- "result"
crs$risk    <- NULL
crs$ident   <- NULL
crs$ignore  <- NULL
crs$weights <- NULL
require(kernlab, quietly=TRUE)
set.seed(crv$seed)
crs$ksvm <- ksvm(as.factor(result) ~ .,
      data=crs$dataset[,c(crs$input, crs$target)],
      kernel="polydot",
      kpar=list("degree"=1),
      prob.model=TRUE)

TRIAL #2: For each result file, add the samples for all sensors for each time into a single number, blasting away any information about individual sensors:

result,time1, time2, time3, time4, time5, time6, time7
no,    0,     549,   892,   512,   2252,  0,     8748
yes,   0,     283,   0,     337,   1518,  0,     4608
no,    0,     555,   753,   518,   2501,  0,     8984
yes,   0,     278,   12,    349,   1438,  3,     4441
no,    0,     602,   901,   499,   2391,  0,     7989
yes,   0,     271,   3,     364,   1474,  1,     4599

And again I use Rattle to generate the SVM, and Rattle's log tab gives me the following script:

library(rattle)
building <- TRUE
scoring  <- ! building
library(colorspace)
crv$seed <- 42 
crs$dataset <- read.csv("file:///C:/Users/mminich/Desktop/stackoverflow/trainingSummary2.csv", na.strings=c(".", "NA", "", "?"), strip.white=TRUE, encoding="UTF-8")
set.seed(crv$seed) 
crs$nobs <- nrow(crs$dataset) # 6 observations 
crs$sample <- crs$train <- sample(nrow(crs$dataset), 0.67*crs$nobs) # 4 observations
crs$validate <- NULL
crs$test <- setdiff(setdiff(seq_len(nrow(crs$dataset)), crs$train), crs$validate) # 2 observations
# The following variable selections have been noted.
crs$input <- c("time1", "time2", "time3", "time4", "time5", "time6", "time7")
crs$numeric <- c("time1", "time2", "time3", "time4", "time5", "time6", "time7")
crs$categoric <- NULL
crs$target  <- "result"
crs$risk    <- NULL
crs$ident   <- NULL
crs$ignore  <- NULL
crs$weights <- NULL
require(kernlab, quietly=TRUE)
set.seed(crv$seed)
crs$ksvm <- ksvm(as.factor(result) ~ .,
      data=crs$dataset[,c(crs$input, crs$target)],
      kernel="polydot",
      kpar=list("degree"=1),
      prob.model=TRUE)

Unfortunately even with nearly 1000 training datasets, both of the resulting models give me only slightly better results than I would get by just random chance. I'm pretty sure it would do better if there's a way to avoid blasting away either the temporal data or the distinction between different sensors. How can I do that? BTW, I don't know if it's important, but the sensor readings for all sensors are taken at almost exactly the same time, but the time difference between one reading and the next varies by maybe 10 to 20% generally from one run to the next (i.e. between "training" files), and I have no control over that. I think that's probably safe to ignore (i.e. I think it's probably safe to just number the readings sequentially like 1,2,3,etc.).

like image 893
phonetagger Avatar asked Apr 17 '15 16:04

phonetagger


1 Answers

SVM takes a feature vector and uses it to build a classifier. Your feature vectors can be of 6 dimensions each from a different source and time as the seventh dimension. Each point in time from which you have a signal will produce another vector. Create t vectors, Vt, of size 7 each and make those your feature vectors. Populate them with your data and pass them into ksvm. By adding t as another feature in the feature vector you are correlating both all the data that happened at a specific time with each other but also it will help SVM learn that their is a progression of values. You can the choose a subset of Vt as a training set. You will have to manually tag these vectors with a label that is the correct classification.

like image 168
Benjy Kessler Avatar answered Nov 02 '22 19:11

Benjy Kessler