Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to plot SVC classification for an unbalanced dataset with scikit-learn and matplotlib?

I have a text classification task with 2599 documents and five labels from 1 to 5. The documents are

label | texts
----------
5     |1190
4     |839
3     |239
1     |204
2     |127

All ready classified this textual data with very low performance, and also get warnings about ill defined metrics:

Accuracy: 0.461057692308

score: 0.461057692308

precision: 0.212574195636

recall: 0.461057692308

  'precision', 'predicted', average, warn_for)
 confussion matrix:
[[  0   0   0   0 153]
  'precision', 'predicted', average, warn_for)
 [  0   0   0   0  94]
 [  0   0   0   0 194]
 [  0   0   0   0 680]
 [  0   0   0   0 959]]

 clasification report:
             precision    recall  f1-score   support

          1       0.00      0.00      0.00       153
          2       0.00      0.00      0.00        94
          3       0.00      0.00      0.00       194
          4       0.00      0.00      0.00       680
          5       0.46      1.00      0.63       959

avg / total       0.21      0.46      0.29      2080

Clearly this is happening by the fact that I have an unbalanced dataset, so I found this paper where the authors propose several aproaches to deal with this issue:

The problem is that with imbalanced datasets, the learned boundary is too close to the positive instances. We need to bias SVM in a way that will push the boundary away from the positive instances. Veropoulos et al [14] suggest using different error costs for the positive (C +) and negative (C - ) classes

I know that this could be very complicated but SVC offers several hyper parameters, So my question is: Is there any way to bias SVC in a way that push the boundary away from possitive instances with the hyper parameters that offer SVC classifier?. I know that this could be a difficult problem but any help is welcome, thanks in advance guys.

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
tfidf_vect= TfidfVectorizer(use_idf=True, smooth_idf=True, sublinear_tf=False, ngram_range=(2,2))
from sklearn.cross_validation import train_test_split, cross_val_score

import pandas as pd
df = pd.read_csv('/path/of/the/file.csv',
                     header=0, sep=',', names=['id', 'text', 'label'])



reduced_data = tfidf_vect.fit_transform(df['text'].values)
y = df['label'].values



from sklearn.decomposition.truncated_svd import TruncatedSVD
svd = TruncatedSVD(n_components=5)
reduced_data = svd.fit_transform(reduced_data)

from sklearn import cross_validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(reduced_data,
                                                    y, test_size=0.33)

#with no weights:

from sklearn.svm import SVC
clf = SVC(kernel='linear', class_weight={1: 10})
clf.fit(reduced_data, y)
prediction = clf.predict(X_test)

w = clf.coef_[0]
a = -w[0] / w[1]
xx = np.linspace(-5, 5)
yy = a * xx - clf.intercept_[0] / w[1]


# get the separating hyperplane using weighted classes
wclf = SVC(kernel='linear', class_weight={1: 10})
wclf.fit(reduced_data, y)

ww = wclf.coef_[0]
wa = -ww[0] / ww[1]
wyy = wa * xx - wclf.intercept_[0] / ww[1]

# plot separating hyperplanes and samples
import matplotlib.pyplot as plt
h0 = plt.plot(xx, yy, 'k-', label='no weights')
h1 = plt.plot(xx, wyy, 'k--', label='with weights')
plt.scatter(reduced_data[:, 0], reduced_data[:, 1], c=y, cmap=plt.cm.Paired)
plt.legend()

plt.axis('tight')
plt.show()

But I get nothing and I cant understand what happened, this is the plot:

weighted vs normal

then:

#Let's show some metrics[unweighted]:
from sklearn.metrics.metrics import precision_score, \
    recall_score, confusion_matrix, classification_report, accuracy_score
print '\nAccuracy:', accuracy_score(y_test, prediction)
print '\nscore:', clf.score(X_train, y_train)
print '\nrecall:', recall_score(y_test, prediction)
print '\nprecision:', precision_score(y_test, prediction)
print '\n clasification report:\n', classification_report(y_test, prediction)
print '\n confussion matrix:\n',confusion_matrix(y_test, prediction)

#Let's show some metrics[weighted]:
print 'weighted:\n'

from sklearn.metrics.metrics import precision_score, \
    recall_score, confusion_matrix, classification_report, accuracy_score
print '\nAccuracy:', accuracy_score(y_test, prediction)
print '\nscore:', wclf.score(X_train, y_train)
print '\nrecall:', recall_score(y_test, prediction)
print '\nprecision:', precision_score(y_test, prediction)
print '\n clasification report:\n', classification_report(y_test, prediction)
print '\n confussion matrix:\n',confusion_matrix(y_test, prediction)

This is the data that Im using. How can I fix this and plot in a right way this problem?. thanks in advance guys!.

From an answer in this question I remove this lines:

#
# from sklearn.decomposition.truncated_svd import TruncatedSVD
# svd = TruncatedSVD(n_components=5)
# reduced_data = svd.fit_transform(reduced_data)


#
# w = clf.coef_[0]
# a = -w[0] / w[1]
# xx = np.linspace(-10, 10)
# yy = a * xx - clf.intercept_[0] / w[1]

# ww = wclf.coef_[0]
# wa = -ww[0] / ww[1]
# wyy = wa * xx - wclf.intercept_[0] / ww[1]
#
# # plot separating hyperplanes and samples
# import matplotlib.pyplot as plt
# h0 = plt.plot(xx, yy, 'k-', label='no weights')
# h1 = plt.plot(xx, wyy, 'k--', label='with weights')
# plt.scatter(reduced_data[:, 0], reduced_data[:, 1], c=y, cmap=plt.cm.Paired)
# plt.legend()
#
# plt.axis('tight')
# plt.show()

This where the results:

Accuracy: 0.787878787879

score: 0.779437105112

recall: 0.787878787879

precision: 0.827705441238

This metrics improved. How can I plot this results in order to have a nice example like the documentation one. I would like to see the behavior of the two hyper planes?. Thanks guys!

like image 822
tumbleweed Avatar asked Feb 12 '15 05:02

tumbleweed


People also ask

What is SVC in Sklearn?

SVC, or Support Vector Classifier, is a supervised machine learning algorithm typically used for classification tasks. SVC works by mapping data points to a high-dimensional space and then finding the optimal hyperplane that divides the data into two classes.

Which of the following techniques can be used to deal with a dataset having imbalanced classes?

Resampling Technique A widely adopted technique for dealing with highly unbalanced datasets is called resampling. It consists of removing samples from the majority class (under-sampling) and/or adding more examples from the minority class (over-sampling).


2 Answers

By reducing your data to 5 features with the SVD:

svd = TruncatedSVD(n_components=5)
reduced_data = svd.fit_transform(reduced_data)

You lose a lot of information. Just by removing those lines I get 78% accuracy.

Leaving the class_weight parameter as you set it seems to do better than removing it. I haven't tried giving it other values.

Look into using k-fold cross validation and grid search to tune the parameters of your model. You can also use a pipeline if you want to reduce the dimensionality of your data, in order to figure out how much you want to reduce it without affecting performance. Here is an example that shows how to tune your entire pipeline using grid search.

As for plotting, you can only plot 2d or 3d data. After you train using more dimensions, you can reduce your data to 2 or 3 dimensions and plot that. See here for a plotting example. The code looks similar to what you're plotting and I got similar results to yours. The problem is that your data has many features and you can only plot things to a 2d or 3d surface. That will usually make it look weird and hard to tell what is going on.

I suggest you don't bother with plotting as it's not going to tell you much for data in high dimensions. Use k-fold cross validation with a grid search in order to get the best parameters and if you want to look into overfitting closer, plot learning curves instead.

All this combined will tell you a lot more about the behavior of your model than plotting the hyperplane.

like image 158
IVlad Avatar answered Sep 30 '22 04:09

IVlad


If I understood your input correctly you have:

1190 of 5 labeled texts 1409 of 1-4 labeled texts

You may try to do a sequental classification. First threat all 5 labels as 1 and all other as 0. Train a classifier for this task

Second, drop out all 5 examples from your dataset. Train classifier to classify 1-4 labels.

Upon classification run first classifier, if it returns 0 - run second classifier to obtain final label.

Though I don't think that this distribution is really skewed and unballanced (it should be smth like 90% of 5, 10% - all rest, to be really skewed, so that it might be interesting to introduce bias to SVC). Thus I think you might want to try some other classification algorithm since looks like your choice is not suitable for this task. Or maybe you need to use different kernel with your SVC (I assume you use linear kernel, try something different - RBF or polynomial maybe).

like image 39
Maksim Khaitovich Avatar answered Sep 30 '22 04:09

Maksim Khaitovich