I have a text classification task with 2599 documents and five labels from 1 to 5. The documents are
label | texts
----------
5 |1190
4 |839
3 |239
1 |204
2 |127
All ready classified this textual data with very low performance, and also get warnings about ill defined metrics:
Accuracy: 0.461057692308
score: 0.461057692308
precision: 0.212574195636
recall: 0.461057692308
'precision', 'predicted', average, warn_for)
confussion matrix:
[[ 0 0 0 0 153]
'precision', 'predicted', average, warn_for)
[ 0 0 0 0 94]
[ 0 0 0 0 194]
[ 0 0 0 0 680]
[ 0 0 0 0 959]]
clasification report:
precision recall f1-score support
1 0.00 0.00 0.00 153
2 0.00 0.00 0.00 94
3 0.00 0.00 0.00 194
4 0.00 0.00 0.00 680
5 0.46 1.00 0.63 959
avg / total 0.21 0.46 0.29 2080
Clearly this is happening by the fact that I have an unbalanced dataset, so I found this paper where the authors propose several aproaches to deal with this issue:
The problem is that with imbalanced datasets, the learned boundary is too close to the positive instances. We need to bias SVM in a way that will push the boundary away from the positive instances. Veropoulos et al [14] suggest using different error costs for the positive (C +) and negative (C - ) classes
I know that this could be very complicated but SVC offers several hyper parameters, So my question is: Is there any way to bias SVC in a way that push the boundary away from possitive instances with the hyper parameters that offer SVC classifier?. I know that this could be a difficult problem but any help is welcome, thanks in advance guys.
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
tfidf_vect= TfidfVectorizer(use_idf=True, smooth_idf=True, sublinear_tf=False, ngram_range=(2,2))
from sklearn.cross_validation import train_test_split, cross_val_score
import pandas as pd
df = pd.read_csv('/path/of/the/file.csv',
header=0, sep=',', names=['id', 'text', 'label'])
reduced_data = tfidf_vect.fit_transform(df['text'].values)
y = df['label'].values
from sklearn.decomposition.truncated_svd import TruncatedSVD
svd = TruncatedSVD(n_components=5)
reduced_data = svd.fit_transform(reduced_data)
from sklearn import cross_validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(reduced_data,
y, test_size=0.33)
#with no weights:
from sklearn.svm import SVC
clf = SVC(kernel='linear', class_weight={1: 10})
clf.fit(reduced_data, y)
prediction = clf.predict(X_test)
w = clf.coef_[0]
a = -w[0] / w[1]
xx = np.linspace(-5, 5)
yy = a * xx - clf.intercept_[0] / w[1]
# get the separating hyperplane using weighted classes
wclf = SVC(kernel='linear', class_weight={1: 10})
wclf.fit(reduced_data, y)
ww = wclf.coef_[0]
wa = -ww[0] / ww[1]
wyy = wa * xx - wclf.intercept_[0] / ww[1]
# plot separating hyperplanes and samples
import matplotlib.pyplot as plt
h0 = plt.plot(xx, yy, 'k-', label='no weights')
h1 = plt.plot(xx, wyy, 'k--', label='with weights')
plt.scatter(reduced_data[:, 0], reduced_data[:, 1], c=y, cmap=plt.cm.Paired)
plt.legend()
plt.axis('tight')
plt.show()
But I get nothing and I cant understand what happened, this is the plot:
then:
#Let's show some metrics[unweighted]:
from sklearn.metrics.metrics import precision_score, \
recall_score, confusion_matrix, classification_report, accuracy_score
print '\nAccuracy:', accuracy_score(y_test, prediction)
print '\nscore:', clf.score(X_train, y_train)
print '\nrecall:', recall_score(y_test, prediction)
print '\nprecision:', precision_score(y_test, prediction)
print '\n clasification report:\n', classification_report(y_test, prediction)
print '\n confussion matrix:\n',confusion_matrix(y_test, prediction)
#Let's show some metrics[weighted]:
print 'weighted:\n'
from sklearn.metrics.metrics import precision_score, \
recall_score, confusion_matrix, classification_report, accuracy_score
print '\nAccuracy:', accuracy_score(y_test, prediction)
print '\nscore:', wclf.score(X_train, y_train)
print '\nrecall:', recall_score(y_test, prediction)
print '\nprecision:', precision_score(y_test, prediction)
print '\n clasification report:\n', classification_report(y_test, prediction)
print '\n confussion matrix:\n',confusion_matrix(y_test, prediction)
This is the data that Im using. How can I fix this and plot in a right way this problem?. thanks in advance guys!.
From an answer in this question I remove this lines:
#
# from sklearn.decomposition.truncated_svd import TruncatedSVD
# svd = TruncatedSVD(n_components=5)
# reduced_data = svd.fit_transform(reduced_data)
#
# w = clf.coef_[0]
# a = -w[0] / w[1]
# xx = np.linspace(-10, 10)
# yy = a * xx - clf.intercept_[0] / w[1]
# ww = wclf.coef_[0]
# wa = -ww[0] / ww[1]
# wyy = wa * xx - wclf.intercept_[0] / ww[1]
#
# # plot separating hyperplanes and samples
# import matplotlib.pyplot as plt
# h0 = plt.plot(xx, yy, 'k-', label='no weights')
# h1 = plt.plot(xx, wyy, 'k--', label='with weights')
# plt.scatter(reduced_data[:, 0], reduced_data[:, 1], c=y, cmap=plt.cm.Paired)
# plt.legend()
#
# plt.axis('tight')
# plt.show()
This where the results:
Accuracy: 0.787878787879
score: 0.779437105112
recall: 0.787878787879
precision: 0.827705441238
This metrics improved. How can I plot this results in order to have a nice example like the documentation one. I would like to see the behavior of the two hyper planes?. Thanks guys!
SVC, or Support Vector Classifier, is a supervised machine learning algorithm typically used for classification tasks. SVC works by mapping data points to a high-dimensional space and then finding the optimal hyperplane that divides the data into two classes.
Resampling Technique A widely adopted technique for dealing with highly unbalanced datasets is called resampling. It consists of removing samples from the majority class (under-sampling) and/or adding more examples from the minority class (over-sampling).
By reducing your data to 5
features with the SVD
:
svd = TruncatedSVD(n_components=5)
reduced_data = svd.fit_transform(reduced_data)
You lose a lot of information. Just by removing those lines I get 78%
accuracy.
Leaving the class_weight
parameter as you set it seems to do better than removing it. I haven't tried giving it other values.
Look into using k-fold cross validation and grid search to tune the parameters of your model. You can also use a pipeline if you want to reduce the dimensionality of your data, in order to figure out how much you want to reduce it without affecting performance. Here is an example that shows how to tune your entire pipeline using grid search.
As for plotting, you can only plot 2d or 3d data. After you train using more dimensions, you can reduce your data to 2 or 3 dimensions and plot that. See here for a plotting example. The code looks similar to what you're plotting and I got similar results to yours. The problem is that your data has many features and you can only plot things to a 2d or 3d surface. That will usually make it look weird and hard to tell what is going on.
I suggest you don't bother with plotting as it's not going to tell you much for data in high dimensions. Use k-fold cross validation with a grid search in order to get the best parameters and if you want to look into overfitting closer, plot learning curves instead.
All this combined will tell you a lot more about the behavior of your model than plotting the hyperplane.
If I understood your input correctly you have:
1190 of 5 labeled texts 1409 of 1-4 labeled texts
You may try to do a sequental classification. First threat all 5 labels as 1 and all other as 0. Train a classifier for this task
Second, drop out all 5 examples from your dataset. Train classifier to classify 1-4 labels.
Upon classification run first classifier, if it returns 0 - run second classifier to obtain final label.
Though I don't think that this distribution is really skewed and unballanced (it should be smth like 90% of 5, 10% - all rest, to be really skewed, so that it might be interesting to introduce bias to SVC). Thus I think you might want to try some other classification algorithm since looks like your choice is not suitable for this task. Or maybe you need to use different kernel with your SVC (I assume you use linear kernel, try something different - RBF or polynomial maybe).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With