Why does classifier.predict() method expects the number of features in the test data to be the same as in training data?

Question

I am trying to build a simple SVM document classifier using scikit-learn and I am using the following code :

import os

import numpy as np

import scipy.sparse as sp

from sklearn.metrics import accuracy_score

from sklearn import svm

from sklearn.metrics import classification_report

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn import cross_validation
from sklearn.datasets import load_svmlight_file

clf=svm.SVC()

path="C:\Python27"


f1=[]

f2=[]
data2=['omg this is not a ship lol']

f=open(path+'\mydata\ACQ\acqtot','r')

f=f.read()

f1=f.split(';',1085)

for i in range(0,1086):

    f2.append('acq')



f1.append('shipping ship')

f2.append('crude')    

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(min_df=1)
counter = CountVectorizer(min_df=1)


x_train=vectorizer.fit_transform(f1)
x_test=vectorizer.fit_transform(data2)

num_sample,num_features=x_train.shape

test_sample,test_features=x_test.shape

print("#samples: %d, #features: %d" % (num_sample, num_features)) #samples: 5, #features: 25
print("#samples: %d, #features: %d" % (test_sample, test_features))#samples: 2, #features: 37

y=['acq','crude']

#print x_test.n_features

clf.fit(x_train,f2)


#den= clf.score(x_test,y)
clf.predict(x_test)

It gives the following error :

(n_features, self.shape_fit_[1]))
ValueError: X.shape[1] = 6 should be equal to 9451, the number of features at training time

But what I am not understanding is why does it expect the no. of features to be the same? If I am entering an absolutely new text data to the machine which it needs to predict, it's obviously not possible that every document will have the same number of features as the data which was used to train it. Do we have to explicitly set the no of features of the test data to be equal to 9451 in this case?

emiguevara · Accepted Answer

To ensure that you have the same feature representation, you should not fit_transform your test data, but only transform it.

x_train=vectorizer.fit_transform(f1)
x_test=vectorizer.transform(data2)

A similar transformation into homogeneous features should be applied to your labels.

Why does classifier.predict() method expects the number of features in the test data to be the same as in training data?

Tags:

python

machine-learning

svm

scikit-learn

finitenessofinfinity

1 Answers

emiguevara

Recent Activity

Donate For Us

Why does classifier.predict() method expects the number of features in the test data to be the same as in training data?

Tags:

python

machine-learning

svm

scikit-learn

finitenessofinfinity

1 Answers

emiguevara

Related questions

Recent Activity

Donate For Us