Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

KNearest Neighbors in sklearn - ValueError: query data dimension must match training data dimension

I'm trying to do a k nearest neighbors prediction on some text recognition data I found on the UCI Machine Learning Database. (https://archive.ics.uci.edu/ml/datasets/Letter+Recognition)

I cross validated the data and tested for accuracy with no issues but I can't run the classifier.predict(). Can anyone shed light on why I'm getting this error? I read up on the curse of dimensionality on the sklearn site but I'm having trouble actually fixing my code.

My code so far is as follows:

import pandas as pd
import numpy as np
from sklearn import preprocessing, cross_validation, neighbors

df = pd.read_csv('KMeans_letter_recog.csv')    

X = np.array(df.drop(['Letter'], 1))
y = np.array(df['Letter'])

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size = 0.2) #20% data used

clf = neighbors.KNeighborsClassifier()
clf.fit(X_train, y_train)
accuracy = clf.score(X_test, y_test) #test
print(accuracy) #this works fine

example = np.array([7,4,3,2,4,5,3,6,7,4,2,3,5,6,8,4])
example = X.reshape(len(example), -1)

prediction = clf.predict(example)
print(prediction) #error

df.head() produces:

 Letter   x-box   y-box   box_width   box_height   on_pix   x-bar_mean  \
0      T       2       8           3            5        1            8   
1      I       5      12           3            7        2           10   
2      D       4      11           6            8        6           10   
3      N       7      11           6            6        3            5   
4      G       2       1           3            1        1            8   

    y-bar_mean   x2bar_mean   y2bar_mean   xybar_mean   x2y_mean   xy2_mean  \
0           13            0            6            6         10          8   
1            5            5            4           13          3          9   
2            6            2            6           10          3          7   
3            9            4            6            4          4         10   
4            6            6            6            6          5          9   

    x-ege   xegvy   y-ege   yegvx  
0       0       8       0       8  
1       2       8       4      10  
2       3       7       3       9  
3       6      10       2       8  
4       1       7       5      10  

My error feed as as such:

Traceback (most recent call last):
  File "C:\Users\jai_j\Desktop\Python Projects\K Means ML.py", line 31, in <module>
    prediction = clf.predict(example)
  File "C:\Users\jai_j\Desktop\Python Projects\WinPython-64bit-3.5.2.3Qt5\python-3.5.2.amd64\lib\site-packages\sklearn\neighbors\classification.py", line 145, in predict
    neigh_dist, neigh_ind = self.kneighbors(X)
  File "C:\Users\jai_j\Desktop\Python Projects\WinPython-64bit-3.5.2.3Qt5\python-3.5.2.amd64\lib\site-packages\sklearn\neighbors\base.py", line 381, in kneighbors
    for s in gen_even_slices(X.shape[0], n_jobs)
  File "C:\Users\jai_j\Desktop\Python Projects\WinPython-64bit-3.5.2.3Qt5\python-3.5.2.amd64\lib\site-packages\sklearn\externals\joblib\parallel.py", line 758, in __call__
    while self.dispatch_one_batch(iterator):
  File "C:\Users\jai_j\Desktop\Python Projects\WinPython-64bit-3.5.2.3Qt5\python-3.5.2.amd64\lib\site-packages\sklearn\externals\joblib\parallel.py", line 608, in dispatch_one_batch
    self._dispatch(tasks)
  File "C:\Users\jai_j\Desktop\Python Projects\WinPython-64bit-3.5.2.3Qt5\python-3.5.2.amd64\lib\site-packages\sklearn\externals\joblib\parallel.py", line 571, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "C:\Users\jai_j\Desktop\Python Projects\WinPython-64bit-3.5.2.3Qt5\python-3.5.2.amd64\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 109, in apply_async
    result = ImmediateResult(func)
  File "C:\Users\jai_j\Desktop\Python Projects\WinPython-64bit-3.5.2.3Qt5\python-3.5.2.amd64\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 326, in __init__
    self.results = batch()
  File "C:\Users\jai_j\Desktop\Python Projects\WinPython-64bit-3.5.2.3Qt5\python-3.5.2.amd64\lib\site-packages\sklearn\externals\joblib\parallel.py", line 131, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "C:\Users\jai_j\Desktop\Python Projects\WinPython-64bit-3.5.2.3Qt5\python-3.5.2.amd64\lib\site-packages\sklearn\externals\joblib\parallel.py", line 131, in <listcomp>
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "sklearn\neighbors\binary_tree.pxi", line 1294, in sklearn.neighbors.kd_tree.BinaryTree.query (sklearn\neighbors\kd_tree.c:11325)
ValueError: query data dimension must match training data dimension

Thank you in advance for any help, I'll keep searching for an answer in the meantime

like image 448
Jai Mahtani Avatar asked Oct 29 '22 14:10

Jai Mahtani


1 Answers

Your problems are that you are not reshaping example and that you are reshaping to incorrect dimensions. You are reshaping your X array to be (16, N), where N is the number of observations in X.

As a result, when you try to predict on example, you end up using your classifier to predict on X reshaped to have N columns, instead of 16 columns as in the one you trained on.

It seems you want to predict on your single example, so you should reshape it instead of X. Presumably, you want example = example.reshape(1, -1) instead of example = X.reshape(len(example), -1).

Initially, you create example with shape (16,). You should reshape it to be (1, 16), by using (1, -1) as the dimensions. This will result in an array with shape (1, 16), which fits your classifier.

To be clear, try changing your code to this:

example = np.array([7,4,3,2,4,5,3,6,7,4,2,3,5,6,8,4])
example = example.reshape(1, -1)

prediction = clf.predict(example)
print(prediction) # shouldn't error anymore
like image 188
Nick Becker Avatar answered Nov 15 '22 09:11

Nick Becker