How to plot SVC classification for an unbalanced dataset with scikit-learn and matplotlib?

Tags:

I have a text classification task with 2599 documents and five labels from 1 to 5. The documents are

label | texts
----------
5     |1190
4     |839
3     |239
1     |204
2     |127

All ready classified this textual data with very low performance, and also get warnings about ill defined metrics:

Accuracy: 0.461057692308

score: 0.461057692308

precision: 0.212574195636

recall: 0.461057692308

  'precision', 'predicted', average, warn_for)
 confussion matrix:
[[  0   0   0   0 153]
  'precision', 'predicted', average, warn_for)
 [  0   0   0   0  94]
 [  0   0   0   0 194]
 [  0   0   0   0 680]
 [  0   0   0   0 959]]

 clasification report:
             precision    recall  f1-score   support

          1       0.00      0.00      0.00       153
          2       0.00      0.00      0.00        94
          3       0.00      0.00      0.00       194
          4       0.00      0.00      0.00       680
          5       0.46      1.00      0.63       959

avg / total       0.21      0.46      0.29      2080

Clearly this is happening by the fact that I have an unbalanced dataset, so I found this paper where the authors propose several aproaches to deal with this issue:

The problem is that with imbalanced datasets, the learned boundary is too close to the positive instances. We need to bias SVM in a way that will push the boundary away from the positive instances. Veropoulos et al [14] suggest using different error costs for the positive (C +) and negative (C - ) classes

I know that this could be very complicated but SVC offers several hyper parameters, So my question is: Is there any way to bias SVC in a way that push the boundary away from possitive instances with the hyper parameters that offer SVC classifier?. I know that this could be a difficult problem but any help is welcome, thanks in advance guys.

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
tfidf_vect= TfidfVectorizer(use_idf=True, smooth_idf=True, sublinear_tf=False, ngram_range=(2,2))
from sklearn.cross_validation import train_test_split, cross_val_score

import pandas as pd
df = pd.read_csv('/path/of/the/file.csv',
                     header=0, sep=',', names=['id', 'text', 'label'])



reduced_data = tfidf_vect.fit_transform(df['text'].values)
y = df['label'].values



from sklearn.decomposition.truncated_svd import TruncatedSVD
svd = TruncatedSVD(n_components=5)
reduced_data = svd.fit_transform(reduced_data)

from sklearn import cross_validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(reduced_data,
                                                    y, test_size=0.33)

#with no weights:

from sklearn.svm import SVC
clf = SVC(kernel='linear', class_weight={1: 10})
clf.fit(reduced_data, y)
prediction = clf.predict(X_test)

w = clf.coef_[0]
a = -w[0] / w[1]
xx = np.linspace(-5, 5)
yy = a * xx - clf.intercept_[0] / w[1]


# get the separating hyperplane using weighted classes
wclf = SVC(kernel='linear', class_weight={1: 10})
wclf.fit(reduced_data, y)

ww = wclf.coef_[0]
wa = -ww[0] / ww[1]
wyy = wa * xx - wclf.intercept_[0] / ww[1]

# plot separating hyperplanes and samples
import matplotlib.pyplot as plt
h0 = plt.plot(xx, yy, 'k-', label='no weights')
h1 = plt.plot(xx, wyy, 'k--', label='with weights')
plt.scatter(reduced_data[:, 0], reduced_data[:, 1], c=y, cmap=plt.cm.Paired)
plt.legend()

plt.axis('tight')
plt.show()

But I get nothing and I cant understand what happened, this is the plot:

weighted vs normal

then:

#Let's show some metrics[unweighted]:
from sklearn.metrics.metrics import precision_score, \
    recall_score, confusion_matrix, classification_report, accuracy_score
print '\nAccuracy:', accuracy_score(y_test, prediction)
print '\nscore:', clf.score(X_train, y_train)
print '\nrecall:', recall_score(y_test, prediction)
print '\nprecision:', precision_score(y_test, prediction)
print '\n clasification report:\n', classification_report(y_test, prediction)
print '\n confussion matrix:\n',confusion_matrix(y_test, prediction)

#Let's show some metrics[weighted]:
print 'weighted:\n'

from sklearn.metrics.metrics import precision_score, \
    recall_score, confusion_matrix, classification_report, accuracy_score
print '\nAccuracy:', accuracy_score(y_test, prediction)
print '\nscore:', wclf.score(X_train, y_train)
print '\nrecall:', recall_score(y_test, prediction)
print '\nprecision:', precision_score(y_test, prediction)
print '\n clasification report:\n', classification_report(y_test, prediction)
print '\n confussion matrix:\n',confusion_matrix(y_test, prediction)

This is the data that Im using. How can I fix this and plot in a right way this problem?. thanks in advance guys!.

From an answer in this question I remove this lines:

#
# from sklearn.decomposition.truncated_svd import TruncatedSVD
# svd = TruncatedSVD(n_components=5)
# reduced_data = svd.fit_transform(reduced_data)


#
# w = clf.coef_[0]
# a = -w[0] / w[1]
# xx = np.linspace(-10, 10)
# yy = a * xx - clf.intercept_[0] / w[1]

# ww = wclf.coef_[0]
# wa = -ww[0] / ww[1]
# wyy = wa * xx - wclf.intercept_[0] / ww[1]
#
# # plot separating hyperplanes and samples
# import matplotlib.pyplot as plt
# h0 = plt.plot(xx, yy, 'k-', label='no weights')
# h1 = plt.plot(xx, wyy, 'k--', label='with weights')
# plt.scatter(reduced_data[:, 0], reduced_data[:, 1], c=y, cmap=plt.cm.Paired)
# plt.legend()
#
# plt.axis('tight')
# plt.show()

This where the results:

Accuracy: 0.787878787879

score: 0.779437105112

recall: 0.787878787879

precision: 0.827705441238

This metrics improved. How can I plot this results in order to have a nice example like the documentation one. I would like to see the behavior of the two hyper planes?. Thanks guys!

822

asked Feb 12 '15 05:02

tumbleweed

2 Answers

By reducing your data to 5 features with the SVD:

svd = TruncatedSVD(n_components=5)
reduced_data = svd.fit_transform(reduced_data)

You lose a lot of information. Just by removing those lines I get 78% accuracy.

Leaving the class_weight parameter as you set it seems to do better than removing it. I haven't tried giving it other values.

Look into using k-fold cross validation and grid search to tune the parameters of your model. You can also use a pipeline if you want to reduce the dimensionality of your data, in order to figure out how much you want to reduce it without affecting performance. Here is an example that shows how to tune your entire pipeline using grid search.

As for plotting, you can only plot 2d or 3d data. After you train using more dimensions, you can reduce your data to 2 or 3 dimensions and plot that. See here for a plotting example. The code looks similar to what you're plotting and I got similar results to yours. The problem is that your data has many features and you can only plot things to a 2d or 3d surface. That will usually make it look weird and hard to tell what is going on.

I suggest you don't bother with plotting as it's not going to tell you much for data in high dimensions. Use k-fold cross validation with a grid search in order to get the best parameters and if you want to look into overfitting closer, plot learning curves instead.

All this combined will tell you a lot more about the behavior of your model than plotting the hyperplane.

158

answered Sep 30 '22 04:09

IVlad

If I understood your input correctly you have:

1190 of 5 labeled texts 1409 of 1-4 labeled texts

You may try to do a sequental classification. First threat all 5 labels as 1 and all other as 0. Train a classifier for this task

Second, drop out all 5 examples from your dataset. Train classifier to classify 1-4 labels.

Upon classification run first classifier, if it returns 0 - run second classifier to obtain final label.

Though I don't think that this distribution is really skewed and unballanced (it should be smth like 90% of 5, 10% - all rest, to be really skewed, so that it might be interesting to introduce bias to SVC). Thus I think you might want to try some other classification algorithm since looks like your choice is not suitable for this task. Or maybe you need to use different kernel with your SVC (I assume you use linear kernel, try something different - RBF or polynomial maybe).

answered Sep 30 '22 04:09

Maksim Khaitovich

Related questions
                            
                                Pytorch: how to convert data into tensor
                            
                                Keras: difference between test_on_batch and predict_on_batch
                            
                                ValueError: `decode_predictions` expects a batch of predictions (i.e. a 2D array of shape (samples, 1000)). Found array with shape: (1, 7)
                            
                                How to find an optimum number of processes in GridSearchCV( ..., n_jobs = ... )?
                            
                                How can I limit regression output between 0 to 1 in keras
                            
                                How to interpret the probabilities (p0, p1) of the result of h2o.predict()
                            
                                How to implement RBF activation function in Keras?
                            
                                What is loss_cls and loss_bbox and why are they always zero in training
                            
                                Building SVM with tensorflow's LinearClassifier and Panda's Dataframes
                            
                                What is the difference between different kernel sizes(1x1, 3x3, 5x5) in a convolution neural network? [closed]
                            
                                Result of GridSearchCV as table
                            
                                Using WeightedRandomSampler in PyTorch
                            
                                Threads is not executing in parallel python with ThreadPoolExecutor
                            
                                machine learning libraries in s+ (or R)?
                            
                                Machine learning, best technique
                            
                                Automatic semantic role labeling in FrameNet
                            
                                Recommender Algorithm
                            
                                What is the difference between point-wise and pair-wise ranking in machine learning
                            
                                What are the metrics to evaluate a machine learning algorithm
                            
                                String Subsequence Kernel and SVM using Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to plot SVC classification for an unbalanced dataset with scikit-learn and matplotlib?

Tags:

artificial-intelligence

machine-learning

nlp

svm

scikit-learn

tumbleweed

People also ask

2 Answers

IVlad

Maksim Khaitovich

Recent Activity

Donate For Us