UPDATED
I am working on machine learning text classification and m using to svc linear kernel the whole code is working except the last line of code that is (print (svm_model_linear.predict_proba(test)) actually m building a classifier in which there are 3 categories cycling, football and badminton and i have some facebook statuses of people which are labeled to these categories I have trained the classifier tested also using train_test_split and after this i have some statuses whcich are not labeled and i want to classify them but last line of code giving me error
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 700)
X = cv.fit_transform(corpus).toarray()
print X
y = dataset.iloc[:, 1].values
print y
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =
0.20, random_state = 0)
from sklearn.svm import SVC
svm_model_linear = SVC(kernel ='linear', C = 1,
probability=True).fit(X_train, y_train)
svm_predictions = svm_model_linear.predict(X_test)
# model accuracy for X_test
accuracy = svm_model_linear.score(X_test, y_test)
#creating a confusion matrix
cm = confusion_matrix(y_test, svm_predictions)
classification of unlabeled data starts from here
data = pd.read_csv('sentence.csv', delimiter = '\t', quoting = 3)
test = []
for j in range(0, 5):
review = re.sub('[^a-zA-Z]', ' ', data['Sentence'][j])
review = review.lower()
review = review.split()
ps = PorterStemmer()
review = [ps.stem(word) for word in review if not word in
set(stopwords.words('english'))]
review = ' '.join(review)
test.append(review)
pred = cv.fit_transform(test).toarray()
print (svm_model_linear.predict_proba(test))
Error
print (svm_model_linear.predict_proba(test))
Traceback (most recent call last):
File "<ipython-input-7-5fa676a0fc00>", line 1, in <module>
print (svm_model_linear.predict_proba(test))
File "/home/letsperf/.local/lib/python2.7/site-packages/sklearn/svm/base.py", line 594, in _predict_proba
X = self._validate_for_predict(X)
File "/home/letsperf/.local/lib/python2.7/site-packages/sklearn/svm/base.py", line 439, in _validate_for_predict
X = check_array(X, accept_sparse='csr', dtype=np.float64, order="C")
File "/home/letsperf/.local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 402, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: X.shape[1] = 15 should be equal to 700, the number of features at training time
Scikit estimators dont work on strings, only on numerical data. Your training part completes successfully because you have converted the corpus from string to numerical using CountVectorizer. You are not doing that for test data.
You need to call cv.tranform(test)
on your test data to make it similar to X which was used to train the model. Only then it will be successfull and of some meaning.
Also make sure that you use the same cv
object by which you transformed your original train corpus
to numerical form.
Update:
You dont fit_transform()
on test data, always only call transform()
as I have advised above. What you are currently doing is:
pred = cv.fit_transform(test).toarray()
which forgets the previous training and re-fits the count-vectorizer which will change the shape of pred
. Change it to:
pred = cv.transform(test).toarray()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With