Profiling SVM (e1071) in R

Tags:

I am new to R and SVMs and I am trying to profile svm function from e1071 package. However, I can't find any large dataset that allows me to get a good profiling range of results varying the size of the input data. Does anyone know how to work svm out? Which dataset should I use? Any particular parameters to svm that makes it work harder?

I copy some commands that I am using to test the performance. Perhaps it is most useful and easier to get what I am trying here:

#loading libraries
library(class)
library(e1071)
#I've been using golubEsets (more examples availables)
library(golubEsets)

#get the data: matrix 7129x38
data(Golub_Train)
n <- exprs(Golub_Train)

#duplicate rows(to make the dataset larger)
n<-rbind(n,n)

#take training samples as a vector
samplelabels <- as.vector(Golub_Train@phenoData@data$ALL.AML)

#calculate svm and profile it
Rprof('svm.out')
svmmodel1 <- svm(x=t(n), y=samplelabels, type='C', kernel="radial", cross=10)
Rprof(NULL)

I keep increasing the dataset duplicating rows and columns but I reached the limit of memory instead of making svm works harder...

789

asked Jun 13 '12 14:06

Manolete

1 Answers

In terms of "working SVM out" - what will make SVM work "harder" is a more complex model which is not easily separated, higher dimensionality and a larger, denser dataset.

SVM performance degrades with:

Dataset size increases (number of data points)
Sparsity decreases (fewer zeros)
Dimensionality increases (number of attributes)
Non-linear kernels are used (and kernel parameters can make the kernel evaluation more complex)

Varying Parameters

Are there parameters you can change to make SVM take longer. Of course the parameters affect the quality of the solution you will get and may not make any sense to use.

Using C-SVM, varying C will result in different runtimes. (The similar parameter in nu-SVM is nu) If the dataset is reasonably separable, making C smaller will result in a longer runtime because the SVM will allow more training points to become support vectors. If the dataset is not very separable, making C bigger will cause longer run times because you are essentially telling SVM you want a narrow-margin solution which fits tightly to the data and that will take much longer to compute when the data doesn't easily separate.

Often you find when doing a parameter search that there are parameters that will increase computation time with no appreciable increase in accuracy.

The other parameters are kernel parameters and if you vary them to increase the complexity of calculating the kernel then naturally the SVM runtime will increase. The linear kernel is simple and will be the fastest; non-linear kernels will of course take longer. Some parameters may not increase the calculation complexity of the kernel, but will force a much more complex model, which may take SVM much longer to find the optimal solution to.

Datasets to Use:

The UCI Machine Learning Repository is a great source of datasets.

The MNIST handwriting recognition dataset is a good one to use - you can randomly select subsets of the data to create increasingly larger sized datasets. Keep in mind the data at the link contains all digits, SVM is of course binary so you would have to reduce the data to just two digits or do some kind of multi-class SVM.

You can easily generate datasets as well. To generate a linear dataset, randomly select a normal vector to a hyperplane, then generate a datapoint and determine which side of the hyperplane it falls on to label it. Add some randomness to allow points within a certain distance of the hyperplane to sometimes be labeled differently. Increase the complexity by increasing that overlap between classes. Or generate some numbers of clusters of normally distributed points, labeled either 1 or -1, so that the distributions overlap at the edges. The classic non-linear example is a checkerboard. Generate points and label them in a checkerboard pattern. To make it more difficult enlarge the number of squares, increase the dimensions and increase the number of datapoints. You will have to use a non-linear kernel for that of course.

answered Sep 30 '22 05:09

karenu

Related questions
                            
                                How can I get the frequencies of common itemsets from the apriori call in R?
                            
                                How can I use variable names to refer to data frame columns with ddply?
                            
                                annoying "feature" (or bugs?) for RODBC
                            
                                Sort xtable() output by p-value from glm model summary
                            
                                R: Apply FUN to kxk subsections of array
                            
                                Where can I find documentation on escape characters like "\"
                            
                                Colorize/highlight values of R ftable() output in knitr/Sweave rapports
                            
                                Find distance of route from get.shortest.paths()
                            
                                How to assign within apply family?
                            
                                How can I use qnorm on Rcpp?
                            
                                R - convert BIG table into matrix by column names
                            
                                Reshape data with repeated columns
                            
                                how to assign a unique identifier to multiple data frame entries
                            
                                unable to find C_kmns object when passed to .Fortran()
                            
                                geom_map borders in ggplot2 - revisited
                            
                                Faster proportion tables in R
                            
                                why causes invalid format '%d in R?
                            
                                running multiple jobs in background at same time (parallel) in r
                            
                                Locate and merge duplicate rows in a data.frame but ignore column order
                            
                                Merge multiple CSV files and remove duplicates in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Profiling SVM (e1071) in R

Tags:

r

machine-learning

statistics

svm

bioinformatics

Manolete

People also ask

1 Answers

karenu

Recent Activity

Donate For Us