I'm trying to do a k nearest neighbors prediction on some text recognition data I found on the UCI Machine Learning Database. (https://archive.ics.uci.edu/ml/datasets/Letter+Recognition)
I cross validated the data and tested for accuracy with no issues but I can't run the classifier.predict(). Can anyone shed light on why I'm getting this error? I read up on the curse of dimensionality on the sklearn site but I'm having trouble actually fixing my code.
My code so far is as follows:
import pandas as pd
import numpy as np
from sklearn import preprocessing, cross_validation, neighbors
df = pd.read_csv('KMeans_letter_recog.csv')
X = np.array(df.drop(['Letter'], 1))
y = np.array(df['Letter'])
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size = 0.2) #20% data used
clf = neighbors.KNeighborsClassifier()
clf.fit(X_train, y_train)
accuracy = clf.score(X_test, y_test) #test
print(accuracy) #this works fine
example = np.array([7,4,3,2,4,5,3,6,7,4,2,3,5,6,8,4])
example = X.reshape(len(example), -1)
prediction = clf.predict(example)
print(prediction) #error
df.head() produces:
Letter x-box y-box box_width box_height on_pix x-bar_mean \
0 T 2 8 3 5 1 8
1 I 5 12 3 7 2 10
2 D 4 11 6 8 6 10
3 N 7 11 6 6 3 5
4 G 2 1 3 1 1 8
y-bar_mean x2bar_mean y2bar_mean xybar_mean x2y_mean xy2_mean \
0 13 0 6 6 10 8
1 5 5 4 13 3 9
2 6 2 6 10 3 7
3 9 4 6 4 4 10
4 6 6 6 6 5 9
x-ege xegvy y-ege yegvx
0 0 8 0 8
1 2 8 4 10
2 3 7 3 9
3 6 10 2 8
4 1 7 5 10
My error feed as as such:
Traceback (most recent call last):
File "C:\Users\jai_j\Desktop\Python Projects\K Means ML.py", line 31, in <module>
prediction = clf.predict(example)
File "C:\Users\jai_j\Desktop\Python Projects\WinPython-64bit-3.5.2.3Qt5\python-3.5.2.amd64\lib\site-packages\sklearn\neighbors\classification.py", line 145, in predict
neigh_dist, neigh_ind = self.kneighbors(X)
File "C:\Users\jai_j\Desktop\Python Projects\WinPython-64bit-3.5.2.3Qt5\python-3.5.2.amd64\lib\site-packages\sklearn\neighbors\base.py", line 381, in kneighbors
for s in gen_even_slices(X.shape[0], n_jobs)
File "C:\Users\jai_j\Desktop\Python Projects\WinPython-64bit-3.5.2.3Qt5\python-3.5.2.amd64\lib\site-packages\sklearn\externals\joblib\parallel.py", line 758, in __call__
while self.dispatch_one_batch(iterator):
File "C:\Users\jai_j\Desktop\Python Projects\WinPython-64bit-3.5.2.3Qt5\python-3.5.2.amd64\lib\site-packages\sklearn\externals\joblib\parallel.py", line 608, in dispatch_one_batch
self._dispatch(tasks)
File "C:\Users\jai_j\Desktop\Python Projects\WinPython-64bit-3.5.2.3Qt5\python-3.5.2.amd64\lib\site-packages\sklearn\externals\joblib\parallel.py", line 571, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "C:\Users\jai_j\Desktop\Python Projects\WinPython-64bit-3.5.2.3Qt5\python-3.5.2.amd64\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 109, in apply_async
result = ImmediateResult(func)
File "C:\Users\jai_j\Desktop\Python Projects\WinPython-64bit-3.5.2.3Qt5\python-3.5.2.amd64\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 326, in __init__
self.results = batch()
File "C:\Users\jai_j\Desktop\Python Projects\WinPython-64bit-3.5.2.3Qt5\python-3.5.2.amd64\lib\site-packages\sklearn\externals\joblib\parallel.py", line 131, in __call__
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "C:\Users\jai_j\Desktop\Python Projects\WinPython-64bit-3.5.2.3Qt5\python-3.5.2.amd64\lib\site-packages\sklearn\externals\joblib\parallel.py", line 131, in <listcomp>
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "sklearn\neighbors\binary_tree.pxi", line 1294, in sklearn.neighbors.kd_tree.BinaryTree.query (sklearn\neighbors\kd_tree.c:11325)
ValueError: query data dimension must match training data dimension
Thank you in advance for any help, I'll keep searching for an answer in the meantime
Your problems are that you are not reshaping example
and that you are reshaping to incorrect dimensions. You are reshaping your X
array to be (16, N)
, where N
is the number of observations in X
.
As a result, when you try to predict on example
, you end up using your classifier to predict on X
reshaped to have N
columns, instead of 16 columns as in the one you trained on.
It seems you want to predict on your single example, so you should reshape it instead of X
. Presumably, you want example = example.reshape(1, -1)
instead of example = X.reshape(len(example), -1)
.
Initially, you create example
with shape (16,)
. You should reshape it to be (1, 16)
, by using (1, -1)
as the dimensions. This will result in an array with shape (1, 16)
, which fits your classifier.
To be clear, try changing your code to this:
example = np.array([7,4,3,2,4,5,3,6,7,4,2,3,5,6,8,4])
example = example.reshape(1, -1)
prediction = clf.predict(example)
print(prediction) # shouldn't error anymore
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With