Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Datasets to test Nonlinear SVM

I'm implementing a nonlinear SVM and I want to test my implementation on a simple not linearly separable data. Google didn't help me find what I want. Can you please advise me where I can find such data. Or at least, how can I generate such data manually ?

Thanks,

like image 863
Morano88 Avatar asked May 07 '11 11:05

Morano88


2 Answers

Well, SVMs are two-class classifiers--i.e., these classifiers place data on either side of a single decision boundary.

Therefore, i would suggest a data set comprised of just two classes (that's not strictly necessary because of course an SVM can separate more than two classes by passing the Classifier multiple times (in series) over the data, it's cumbersome to do this during initial testing).

So for instance, you can use the iris data set, linked to in Scott's answer; it's comprised of three classes, Class I is linear separable from Class II and III; Class II and III are not linear separable. If you want to use this data set, for convenience-sake you might prefer to remove Class I (approx. the first 50 data rows), so what remains is a two-class system, in which the two remaining classes are not linearly separable.

The iris data set is quite small (150 x 4, or 50 rows/class x four features)--depending where you are with your SVM prototype testing, this might be exactly what you want, or you might want a larger data set.

An interesting family of data sets that are comprised of just two classes and that are definitely non-linearly separable are the the anonymized data sets supplied by the mega-dating site eHarmony (no affiliation of any kind). In addition to the iris data, I like to use these data sets for SVM prototype evaluation because they are large data sets with quite a few features yet still comprised of just two non-linearly separable classes.

I am aware of two places from which you can retrieve this data. The first Site has a single data set (PCI Code downloads, chapter9, matchmaker.csv) comprised of 500 data points (row) and six features (columns). Although this set is simpler to work with, the data is more or less in a 'raw' form and will require some processing before you can use it.

The second source for this data, contains two eHarmony data sets, one of them is comprised of over half million rows and 59 features. In addition, these two data sets have undergone substantial processing such that the only task required before feeding them to your SVM is routine rescaling of the features.

like image 140
doug Avatar answered Nov 04 '22 12:11

doug


The particular data set you need will depend highly on your choice of kernel function, so It seems the easiest method is simply creating a toy data set yourself.

Some helpful ideas:

  • Concentric circles
  • Spiral-shaped classes
  • Nested banana-shaped classes

If you just want a random data set which is not linearly separable, may I suggest the Iris dataset? It is a multivariate data set where at least a couple of the classes in question are not linearly separable.

Hope this helps!

like image 31
Scott Avatar answered Nov 04 '22 13:11

Scott