Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SVM - relation between the number of training samples and the number of features

What should be the relation between the number of training samples and the dimensionality of the training data?

For example, I am having a case with 20000 training samples with 16000 features. I am considering the case of using PCA to obtain some dimensionality reduction, but I do not know to how many dimensions should I reduce my training data. Is there a relation between these? I am using a Support Vector Machine classifier with 2 classes and a linear kernel.

like image 461
Simon Avatar asked Nov 06 '13 13:11

Simon


2 Answers

The solution found by the SVM is automatically restricted to the space spanned by the samples, so using PCA to just get rid of dimensions with zero variance won't change the solution. And as damienfrancois wrote, reducing beyond that you risk destroying relevant information. To avoid that you have two options:

1) Trust that structural risk minimization is not just an interesting theoretical concept, but also does the right thing for your application, and just use the data as they are.

2) Use a feature selection algorithm to find out which of the features / combinations are actually informative. However, finding the optimum combination of features is clearly not feasible for so many features, so you can probably just sort the features by individual performance (in the linear case: a t-score) and then test how many of the best features you need in order get a good result.

Zaw Lin's comment is of course correct, you can always separate classes in such a high-dimensional space, but equally of course classifier performance should not be assessed on the training data but e.g. using cross-validation.

like image 186
A. Donda Avatar answered Oct 05 '22 12:10

A. Donda


It is all a matter of intrinsic dimensionality of the data and VC-dimenionality. But hose theoretical concepts won't be of much help in practice.

In practice, with a properly regularized linear SVM, you might simply work with the 16k features.

If you want to use PCA, look at the scree plot to get the percentage of variance kept by using n principal components with highest corresponding eigen values. But PCA-based feature extraction will 'destroy' information if the relationship between the label and the data is highly nonlinear.

Another option is to use other L1-regularized models such as LASSO.

like image 43
damienfrancois Avatar answered Oct 05 '22 12:10

damienfrancois