I am currently working on large scale hierarchical text classification of ODP documents. The dataset provided to me is in the libSVM format. I am trying to run the linear kernel SVM of python's scikit-learn to develop the model. Below is the sample data from training samples:
29 9454:1 11742:1 18884:14 26840:1 35147:1 52782:1 72083:1 73244:1 78945:1 79913:1 79986:1 86710:3 117286:1 139820:1 142458:1 146315:1 151005:2 161454:3 172237:1 1091130:1 1113562:1 1133451:1 1139046:1 1157534:1 1180618:2 1182024:1 1187711:1 1194345:3
33 2474:1 8152:1 19529:2 35038:1 48104:1 59738:1 61854:3 67943:1 74093:1 78945:1 88558:1 90848:1 97087:1 113284:16 118917:1 122375:1 124939:1
The following is the code I have used to construct the linear SVM model
from sklearn.datasets import load_svmlight_file
from sklearn import svm
X_train, y_train = load_svmlight_file("/path-to-file/train.txt")
X_test, y_test = load_svmlight_file("/path-to-file/test.txt")
clf = svm.SVC(kernel='linear')
clf.fit(X_train, y_train)
print clf.score(X_test,y_test)
Upon running clf.score(), I get the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-6-b285fbfb3efe> in <module>()
1 start_time = time.time()
----> 2 print clf.score(X_test,y_test)
3 print time.time() - start_time, "seconds"
/Users/abc/anaconda/lib/python2.7/site-packages/sklearn/base.pyc in score(self, X, y)
292 """
293 from .metrics import accuracy_score
--> 294 return accuracy_score(y, self.predict(X))
295
296
/Users/abc/anaconda/lib/python2.7/site-packages/sklearn/svm/base.pyc in predict(self, X)
464 Class labels for samples in X.
465 """
--> 466 y = super(BaseSVC, self).predict(X)
467 return self.classes_.take(y.astype(np.int))
468
/Users/abc/anaconda/lib/python2.7/site-packages/sklearn/svm/base.pyc in predict(self, X)
280 y_pred : array, shape (n_samples,)
281 """
--> 282 X = self._validate_for_predict(X)
283 predict = self._sparse_predict if self._sparse else self._dense_predict
284 return predict(X)
/Users/abc/anaconda/lib/python2.7/site-packages/sklearn/svm/base.pyc in _validate_for_predict(self, X)
402 raise ValueError("X.shape[1] = %d should be equal to %d, "
403 "the number of features at training time" %
--> 404 (n_features, self.shape_fit_[1]))
405 return X
406
ValueError: X.shape[1] = 1199847 should be equal to 1199830, the number of features at training time
Can someone please let me know what is exactly wrong with either this code or the piece of data I have? Thanks in advance
Below attached are the values of X_train, y_train, X_test, and y_test:
X_train:
(0, 9453) 1.0
(0, 11741) 1.0
(0, 18883) 14.0
(0, 26839) 1.0
(0, 35146) 1.0
(0, 52781) 1.0
(0, 72082) 1.0
(0, 73243) 1.0
(0, 78944) 1.0
(0, 79912) 1.0
(0, 79985) 1.0
(0, 86709) 3.0
(0, 117285) 1.0
(0, 139819) 1.0
(0, 142457) 1.0
(0, 146314) 1.0
(0, 151004) 2.0
(0, 161453) 3.0
(0, 172236) 1.0
(0, 187531) 2.0
(0, 202462) 1.0
(0, 210417) 1.0
(0, 250581) 1.0
(0, 251689) 1.0
(0, 296384) 2.0
: :
(4462, 735469) 1.0
(4462, 737059) 15.0
(4462, 740127) 1.0
(4462, 743798) 1.0
(4462, 766063) 1.0
(4462, 778958) 2.0
(4462, 784004) 4.0
(4462, 837264) 2.0
(4462, 839095) 22.0
(4462, 844735) 6.0
(4462, 859721) 2.0
(4462, 875267) 1.0
(4462, 910761) 1.0
(4462, 931244) 1.0
(4462, 945069) 6.0
(4462, 948728) 1.0
(4462, 948850) 2.0
(4462, 957682) 1.0
(4462, 975170) 1.0
(4462, 989192) 1.0
(4462, 1014294) 1.0
(4462, 1042424) 1.0
(4462, 1049027) 1.0
(4462, 1072931) 1.0
(4462, 1145790) 1.0
y_train:
[ 2.90000000e+01 3.30000000e+01 3.30000000e+01 ..., 1.65475000e+05
1.65518000e+05 1.65518000e+05]
X_test:
(0, 18573) 1.0
(0, 23501) 1.0
(0, 29954) 1.0
(0, 42112) 1.0
(0, 46402) 1.0
(0, 63041) 2.0
(0, 67942) 2.0
(0, 83522) 1.0
(0, 88413) 2.0
(0, 99454) 1.0
(0, 126041) 1.0
(0, 139819) 1.0
(0, 142678) 1.0
(0, 151004) 1.0
(0, 166351) 2.0
(0, 173794) 1.0
(0, 192162) 3.0
(0, 210417) 2.0
(0, 254468) 1.0
(0, 263895) 2.0
(0, 277567) 1.0
(0, 278419) 2.0
(0, 279181) 2.0
(0, 281319) 2.0
(0, 298898) 1.0
: :
(1857, 1100504) 3.0
(1857, 1103247) 1.0
(1857, 1105578) 1.0
(1857, 1108986) 2.0
(1857, 1118486) 1.0
(1857, 1120807) 9.0
(1857, 1129243) 2.0
(1857, 1131786) 1.0
(1857, 1134029) 2.0
(1857, 1134410) 5.0
(1857, 1134494) 1.0
(1857, 1139045) 25.0
(1857, 1142239) 3.0
(1857, 1142651) 1.0
(1857, 1144787) 1.0
(1857, 1151891) 1.0
(1857, 1152094) 1.0
(1857, 1157533) 1.0
(1857, 1159376) 1.0
(1857, 1178944) 1.0
(1857, 1181310) 2.0
(1857, 1182023) 1.0
(1857, 1187098) 1.0
(1857, 1194344) 2.0
(1857, 1195819) 9.0
y_test:
[ 2.90000000e+01 3.30000000e+01 1.56000000e+02 ..., 1.65434000e+05
1.65475000e+05 1.65518000e+05]
The objective of a Linear SVC (Support Vector Classifier) is to fit to the data you provide, returning a "best fit" hyperplane that divides, or categorizes, your data. From there, after getting the hyperplane, you can then feed some features to your classifier to see what the "predicted" class is.
Linear Support Vector Classification. Similar to SVC with parameter kernel='linear', but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.
The difference between them is that LinearSVC implemented in terms of liblinear while SVC is implemented in libsvm. That's the reason LinearSVC has more flexibility in the choice of penalties and loss functions. It also scales better to large number of samples.
Support vector machines (SVMs) are supervised machine learning algorithms for outlier detection, regression, and classification that are both powerful and adaptable. Sklearn SVMs are commonly employed in classification tasks because they are particularly efficient in high-dimensional fields.
The error message
ValueError: X.shape[1] = 1199847 should be equal to 1199830, the number of features at training time
explains itself: the number of features in the testing data is different compared to the training data, which has been used to train the model. That is, X_train.shape[1]
is not equal to X_test.shape[1]
.
You should check why they are not equal, as they should be.
One possibility is that they are loaded as sparse matrices and the number of features is inferred by load_svmlight_file
. If the testing data contains features unseen by the training data, the resulting X_test
might have a larger dimension. To avoid this, you can specify the number of features in load_svmlight_file
by passing the argument n_features
.
You can use n_features
option.
X_train, y_train = load_svmlight_file("/path-to-file/train.txt")
X_test, y_test = load_svmlight_file("/path-to-file/test.txt", n_features=X_train.shape[1])
This error also can be solved by using load_svmlight_files
from sklearn.datasets import load_svmlight_files
X_train, y_train, X_test, y_test = load_svmlight_files(['/path-to-file/train.txt', '/path-to-file/test.txt'])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With