Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ValueError: X.shape[1] = 15 should be equal to 700, the number of features at training time

UPDATED

I am working on machine learning text classification and m using to svc linear kernel the whole code is working except the last line of code that is (print (svm_model_linear.predict_proba(test)) actually m building a classifier in which there are 3 categories cycling, football and badminton and i have some facebook statuses of people which are labeled to these categories I have trained the classifier tested also using train_test_split and after this i have some statuses whcich are not labeled and i want to classify them but last line of code giving me error

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 700)
X = cv.fit_transform(corpus).toarray()
print X
y = dataset.iloc[:, 1].values
print y

# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 
0.20, random_state = 0)


from sklearn.svm import SVC
svm_model_linear = SVC(kernel ='linear', C = 1, 
probability=True).fit(X_train, y_train)
svm_predictions = svm_model_linear.predict(X_test)



# model accuracy for X_test  
accuracy = svm_model_linear.score(X_test, y_test)
#creating a confusion matrix
cm = confusion_matrix(y_test, svm_predictions)

classification of unlabeled data starts from here

data = pd.read_csv('sentence.csv', delimiter = '\t', quoting = 3)

test = []
for j in range(0, 5):
    review = re.sub('[^a-zA-Z]', ' ', data['Sentence'][j])
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in 
    set(stopwords.words('english'))]
    review = ' '.join(review)
    test.append(review)
pred = cv.fit_transform(test).toarray()
print (svm_model_linear.predict_proba(test))

Error

print (svm_model_linear.predict_proba(test))

Traceback (most recent call last):

  File "<ipython-input-7-5fa676a0fc00>", line 1, in <module>
print (svm_model_linear.predict_proba(test))

  File "/home/letsperf/.local/lib/python2.7/site-packages/sklearn/svm/base.py", line 594, in _predict_proba
X = self._validate_for_predict(X)

  File "/home/letsperf/.local/lib/python2.7/site-packages/sklearn/svm/base.py", line 439, in _validate_for_predict
X = check_array(X, accept_sparse='csr', dtype=np.float64, order="C")

  File "/home/letsperf/.local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 402, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)

ValueError: X.shape[1] = 15 should be equal to 700, the number of features at training time
like image 945
Dexter Avatar asked Nov 03 '17 11:11

Dexter


1 Answers

Scikit estimators dont work on strings, only on numerical data. Your training part completes successfully because you have converted the corpus from string to numerical using CountVectorizer. You are not doing that for test data.

You need to call cv.tranform(test) on your test data to make it similar to X which was used to train the model. Only then it will be successfull and of some meaning.

Also make sure that you use the same cv object by which you transformed your original train corpus to numerical form.

Update:

You dont fit_transform() on test data, always only call transform() as I have advised above. What you are currently doing is:

pred = cv.fit_transform(test).toarray()

which forgets the previous training and re-fits the count-vectorizer which will change the shape of pred. Change it to:

pred = cv.transform(test).toarray()
like image 135
Vivek Kumar Avatar answered Sep 25 '22 21:09

Vivek Kumar