I am new to R and SVMs and I am trying to profile svm
function from e1071
package. However, I can't find any large dataset that allows me to get a good profiling range of results varying the size of the input data. Does anyone know how to work svm
out? Which dataset should I use? Any particular parameters to svm
that makes it work harder?
I copy some commands that I am using to test the performance. Perhaps it is most useful and easier to get what I am trying here:
#loading libraries
library(class)
library(e1071)
#I've been using golubEsets (more examples availables)
library(golubEsets)
#get the data: matrix 7129x38
data(Golub_Train)
n <- exprs(Golub_Train)
#duplicate rows(to make the dataset larger)
n<-rbind(n,n)
#take training samples as a vector
samplelabels <- as.vector(Golub_Train@phenoData@data$ALL.AML)
#calculate svm and profile it
Rprof('svm.out')
svmmodel1 <- svm(x=t(n), y=samplelabels, type='C', kernel="radial", cross=10)
Rprof(NULL)
I keep increasing the dataset duplicating rows and columns but I reached the limit of memory instead of making svm
works harder...
When it comes to SVM, there are many packages available in R to implement it. However, e1071 is the most intuitive package for this purpose. The svm () function of the e1071 package provides a robust interface in the form of the libsvm. This interface makes implementing SVM’s very quick and simple.
Install e1071 package and load it using the following commands: The R implementation depends on the S3 class mechanism. It provides a training function with standard and formula interfaces, and a predict () method. Also provides a plot () method for visualizing data, support vectors, and decision boundaries.
e1071 •Functions for latent class analysis, short time Fourier transform, fuzzy clustering, support vector machines, shortest path computation, bagged clustering, naive Bayes classifier, ... •interface to libSVM
An SVM model is a representation of the input data objects in a graphical space with a clear gap between groups of points representing different categories. This division is caused by the hyperplane, which is a line (in case of 2D space) or a plane (in case of the 3D plane).
In terms of "working SVM out" - what will make SVM work "harder" is a more complex model which is not easily separated, higher dimensionality and a larger, denser dataset.
SVM performance degrades with:
Varying Parameters
Are there parameters you can change to make SVM take longer. Of course the parameters affect the quality of the solution you will get and may not make any sense to use.
Using C-SVM, varying C will result in different runtimes. (The similar parameter in nu-SVM is nu) If the dataset is reasonably separable, making C smaller will result in a longer runtime because the SVM will allow more training points to become support vectors. If the dataset is not very separable, making C bigger will cause longer run times because you are essentially telling SVM you want a narrow-margin solution which fits tightly to the data and that will take much longer to compute when the data doesn't easily separate.
Often you find when doing a parameter search that there are parameters that will increase computation time with no appreciable increase in accuracy.
The other parameters are kernel parameters and if you vary them to increase the complexity of calculating the kernel then naturally the SVM runtime will increase. The linear kernel is simple and will be the fastest; non-linear kernels will of course take longer. Some parameters may not increase the calculation complexity of the kernel, but will force a much more complex model, which may take SVM much longer to find the optimal solution to.
Datasets to Use:
The UCI Machine Learning Repository is a great source of datasets.
The MNIST handwriting recognition dataset is a good one to use - you can randomly select subsets of the data to create increasingly larger sized datasets. Keep in mind the data at the link contains all digits, SVM is of course binary so you would have to reduce the data to just two digits or do some kind of multi-class SVM.
You can easily generate datasets as well. To generate a linear dataset, randomly select a normal vector to a hyperplane, then generate a datapoint and determine which side of the hyperplane it falls on to label it. Add some randomness to allow points within a certain distance of the hyperplane to sometimes be labeled differently. Increase the complexity by increasing that overlap between classes. Or generate some numbers of clusters of normally distributed points, labeled either 1 or -1, so that the distributions overlap at the edges. The classic non-linear example is a checkerboard. Generate points and label them in a checkerboard pattern. To make it more difficult enlarge the number of squares, increase the dimensions and increase the number of datapoints. You will have to use a non-linear kernel for that of course.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With