Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SVM Classification - minimum number of input sets for each class

I'm trying to build an app to detect images which are advertisements from the webpages. Once I detect those I`ll not be allowing those to be displayed on the client side.

From the help that I got on this Stackoverflow question, I thought SVM is the best approach to my aim.

So, I have coded SVM and an SMO myself. The dataset which I have got from UCI data repository has 3280 instances ( Link to Dataset ) where around 400 of them are from class representing Advertisement images and rest of them representing non-advertisement images.

Right now I'm taking the first 2800 input sets and training the SVM. But after looking at the accuracy rate I realised that most of those 2800 input sets are from non-advertisement image class. So I`m getting very good accuracy for that class.

So what can I do here? About how many input set shall I give to SVM to train and how many of them for each class?

Thanks. Cheers. ( Basically made a new question because the context was different from my previous question. Optimization of Neural Network input data )


Thanks for the reply. I want to check whether I`m deriving the C values for ad and non-ad class correctly or not. Please give me feedback on this.

enter image description here

Or you u can see the doc version here.

You can see graph of y1 eqaul to y2 here enter image description here

and y1 not equal to y2 here enter image description here

like image 653
Amol Joshi Avatar asked Dec 29 '22 01:12

Amol Joshi


2 Answers

There are two ways of going about this. One would be to balance the training data so it includes an equal number of advertisement and non-advertisement images. This could be done by either oversampling the 400 advertisement images or undersampling the thousands of non-advertisement images. Since training time can increase dramatically with the number of data points used, you should probably first try undersampling the non-advertisement images and create a training set with the 400 ad images and 400 randomly selected non-advertisements.

The other solution would be to use a weighted SVM so that margin errors for the ad images are weighted more heavily than those for non-ads, for the package libSVM this is done with the -wi flag. From your description of the data, you could try weighing the ad images about 7 times more heavily than the non-ads.

like image 81
dmcer Avatar answered Dec 31 '22 15:12

dmcer


The required size of your training set depends on the sparseness of the feature space. As far as I can see, you are not discussing what image features you have chose to use. Before you can train, you need to to convert each image into a vector of numbers (features) that describe the image, hopefully capturing the aspects that you care about.

Oh, and unless you are reimplementing SVM for sport, I'd recomment just using libsvm,

like image 29
Vebjorn Ljosa Avatar answered Dec 31 '22 15:12

Vebjorn Ljosa