Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to fix the false positives rate of a linear SVM?

I am an SVM newbie and this is my use case: I have a lot of unbalanced data to be binary classified using a linear SVM. I need to fix the false positives rate at certain values and measure the corresponding false negatives for each value. I am using something like the following code making use of scikit-learn svm implementation:

# define training data
X = [[0, 0], [1, 1]]
y = [0, 1]

# define and train the SVM
clf = svm.LinearSVC(C=0.01, class_weight='auto') #auto for unbalanced distributions
clf.fit(X, y)

# compute false positives and false negatives
predictions = [clf.predict(ex) for ex in X]    
false_positives = [(a, b) for (a, b) in zip(predictions,y) if a != b and b == 0]
false_negatives = [(a, b) for (a, b) in zip(predictions,y) if a != b and b == 1] 

Is there a way to play with a parameter (or a few parameters) of the classifier such that one the measurement metrics is effectively fixed?

like image 514
Rabih Kodeih Avatar asked Jan 29 '14 02:01

Rabih Kodeih


People also ask

How do you solve a false positive rate?

The false positive rate is calculated as FP/FP+TN, where FP is the number of false positives and TN is the number of true negatives (FP+TN being the total number of negatives). It's the probability that a false alarm will be raised: that a positive result will be given when the true value is negative.

How can you reduce false positives in classification?

To minimize the number of False Negatives (FN) or False Positives (FP) we can also retrain a model on the same data with slightly different output values more specific to its previous results. This method involves taking a model and training it on a dataset until it optimally reaches a global minimum.

How can SVM reduce false negatives?

To minimize false negatives, you could set higher weights for training samples labeled as the positive class, by default the weights are set to 1 for all classes. To change this, use the hyper-parameter class_weight .

How do you reduce a false negative rate?

To reduce False Negatives the threshold line was lowered since that would force the model to predict less inputs as False, therefore reducing the number of False Negative cases. Similarly increasing the threshold line decreased the number of False Positives.


2 Answers

The class_weights parameter allows you to push this false positive rate up or down. Let me use an everyday example to illustrate how this work. Suppose you own a night club, and you operate under two constraints:

  1. You want as many people as possible to enter the club (paying customers)
  2. You do not want any underage people in, as this will get you in trouble with the state

On an average day, (say) only 5% percent of the people attempting to enter the club will be underage. You are faced with a choice: being lenient or being strict. The former will boost your profits by as much as 5%, but you are running the risk of an expensive lawsuit. The latter will inevitably mean some people who are just above the legal age will be denied entry, which will cost you money too. You want to adjust the relative cost of leniency vs strictness. Note: you cannot directly control how many underage people enter the club, but you can control how strict your bouncers are.

Here is a bit of Python that shows what happens as you change the relative importance.

from collections import Counter
import numpy as np
from sklearn.datasets import load_iris
from sklearn.svm import LinearSVC

data = load_iris()

# remove a feature to make the problem harder
# remove the third class for simplicity
X = data.data[:100, 0:1] 
y = data.target[:100] 
# shuffle data
indices = np.arange(y.shape[0])
np.random.shuffle(indices)
X = X[indices, :]
y = y[indices]

for i in range(1, 20):
    clf = LinearSVC(class_weight={0: 1, 1: i})
    clf = clf.fit(X[:50, :], y[:50])
    print i, Counter(clf.predict(X[50:]))
    # print clf.decision_function(X[50:])

Which outputs

1 Counter({1: 22, 0: 28})
2 Counter({1: 31, 0: 19})
3 Counter({1: 39, 0: 11})
4 Counter({1: 43, 0: 7})
5 Counter({1: 43, 0: 7})
6 Counter({1: 44, 0: 6})
7 Counter({1: 44, 0: 6})
8 Counter({1: 44, 0: 6})
9 Counter({1: 47, 0: 3})
10 Counter({1: 47, 0: 3})
11 Counter({1: 47, 0: 3})
12 Counter({1: 47, 0: 3})
13 Counter({1: 47, 0: 3})
14 Counter({1: 47, 0: 3})
15 Counter({1: 47, 0: 3})
16 Counter({1: 47, 0: 3})
17 Counter({1: 48, 0: 2})
18 Counter({1: 48, 0: 2})
19 Counter({1: 48, 0: 2})

Note how the number of data points classified as 0 decreases are the relative weight of class 1 increases. Assuming you have the computational resources and time to train and evaluate 10 classifiers, you can plot the precision and recall of each one and get a figure like the one below (shamelessly stolen off the internet). You can then use that to decide what the right value of class_weights is for your use case.

Precision-recall tradeoff

like image 142
mbatchkarov Avatar answered Oct 22 '22 03:10

mbatchkarov


The predict method for LinearSVC in sklearn looks like this

def predict(self, X):
    """Predict class labels for samples in X.

    Parameters
    ----------
    X : {array-like, sparse matrix}, shape = [n_samples, n_features]
        Samples.

    Returns
    -------
    C : array, shape = [n_samples]
        Predicted class label per sample.
    """
    scores = self.decision_function(X)
    if len(scores.shape) == 1:
        indices = (scores > 0).astype(np.int)
    else:
        indices = scores.argmax(axis=1)
    return self.classes_[indices]

So in addition to what mbatchkarov suggested you can change the decisions made by the classifier (any classifier really) by changing the boundary at which the classifier says something is of one class or the other.

from collections import Counter
import numpy as np
from sklearn.datasets import load_iris
from sklearn.svm import LinearSVC

data = load_iris()

# remove a feature to make the problem harder
# remove the third class for simplicity
X = data.data[:100, 0:1] 
y = data.target[:100] 
# shuffle data
indices = np.arange(y.shape[0])
np.random.shuffle(indices)
X = X[indices, :]
y = y[indices]

decision_boundary = 0
print Counter((clf.decision_function(X[50:]) > decision_boundary).astype(np.int8))
Counter({1: 27, 0: 23})

decision_boundary = 0.5
print Counter((clf.decision_function(X[50:]) > decision_boundary).astype(np.int8))
Counter({0: 39, 1: 11})

You can optimize the decision boundary to be anything depending on your needs.

like image 34
Matti Lyra Avatar answered Oct 22 '22 03:10

Matti Lyra