Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

LIBSVM Data Preparation: Excel data to LIBSVM format

Tags:

libsvm

I want to study how to perform LIBSVM for regression and I'm currently stuck in preparing my data. Currently I have this form of data in .csv and .xlsx format and I want to convert it into libsvm data format.

Current Data

So far, I understand that the data should be in this format so that it can be used in LIBSVM:

LIBSVM format

Based on what I read, for regression, "label" is the target value which can be any real number.

I am doing a electric load prediction study. Can anyone tell me what it is? And finally, how should I organized my columns and rows?

like image 745
Gabriel Luna Avatar asked Nov 05 '16 09:11

Gabriel Luna


1 Answers

The LIBSVM data format is given by:

<label> <index1>:<value1> <index2>:<value2> ...
...
...

As you can see, this forms a matrix [(IndexCount + 1) columns, LineCount rows]. More precisely a sparse matrix. If you specify a value for each index, you have a dense matrix, but if you only specify a few indices like <label> <5:value> <8:value>, only the indices 5 and 8 and of course label will have a custom value, all other values are set to 0. This is just for notational simplicity or to save space, since datasets can be huge.

For the meanig of the tags, I cite the ReadMe file:

<label> is the target value of the training data. For classification, it should be an integer which identifies a class (multi-class classification is supported). For regression, it's any real number. For one-class SVM, it's not used so can be any number. is an integer starting from 1, <value> is a real number. The indices must be in an ascending order.

As you can see, the label is the data you want to predict. The index marks a feature of your data and its value. A feature is simply an indicator to associate or correlate your target value with, so a better prediction can be made.

Totally Fictional story time: Gabriel Luna (a totally fictional character) wants to predict his energy consumption for the next few days. He found out, that the outside temperature from the day before is a good indicator for that, so he selects Temperature with index 1 as feature. Important: Indices always start at one, zero can sometimes cause strange LIBSVM behaviour. Then, he surprisingly notices, that the day of the week (Monday to Sunday or 0 to 6) also affects his load, so he selects it as a second feature with index 2. A matrix row for LIBSVM now has the following format:

<myLoad_Value> <1:outsideTemperatureFromYesterday_Value> <2:dayOfTheWeek_Value>

Gabriel Luna (he is Batman at night) now captures these data over a few weeks, which could look something like this (load in kWh, temperature in °C, day as mentioned above):

0.72 1:25 2:0
0.65 1:21 2:1
0.68 2:29 2:2
...

Notice, that we could leave out 2:0, because of the sparse matrix format. This would be your training data to train a LIBSVM model. Then, we predict the load of tomorrow as follows. You know the temperature of today, let us say 23°C and today is Tuesday, which is 1, so tomorrow is 2. So, this is the line or vector to use with the model:

0 1:23 2:2

Here, you can set the <label> value arbitrarily. It will be overwritten with the predicted value. I hope this helps.

like image 193
thatguy Avatar answered Oct 18 '22 18:10

thatguy