scitkit-learn query data dimension must match training data dimension

Tags:

I'm trying to use this code from the scikit learn site:

http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html

I'm using my own data. My problem is, I have a lot more than two features. If I want to "expand" the features from 2 to 3 or 4....

I'm getting:

"query data dimension must match training data dimension"

def machine():
with open("test.txt",'r') as csvr:

    reader= csv.reader(csvr,delimiter='\t')

    for i,row in enumerate(reader):

        if i==0:
            pass
        elif '' in row[2:]:
            pass
        else:
            liste.append(map(float,row[2:]))

a = np.array(liste)
h = .02 
names = ["Nearest Neighbors", "Linear SVM", "RBF SVM", "Decision Tree",
         "Random Forest", "AdaBoost", "Naive Bayes", "LDA", "QDA"]
classifiers = [
    KNeighborsClassifier(1),
    SVC(kernel="linear", C=0.025),
    SVC(gamma=2, C=1),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    AdaBoostClassifier(),
    GaussianNB(),
    LDA(),
    QDA()]



X = a[:,:3]
y = np.ravel(a[:,13])

linearly_separable = (X, y)
datasets =[linearly_separable]
figure = plt.figure(figsize=(27, 9))
i = 1

for ds in datasets:
    X, y = ds

    X = StandardScaler().fit_transform(X)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)

    x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))

    cm = plt.cm.RdBu
    cm_bright = ListedColormap(['#FF0000', '#0000FF'])
    ax = plt.subplot(len(datasets), len(classifiers) + 1, i)

    ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright)

    ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright, alpha=0.6)
    ax.set_xlim(xx.min(), xx.max())
    ax.set_ylim(yy.min(), yy.max())
    ax.set_xticks(())
    ax.set_yticks(())
    i += 1

    for name, clf in zip(names, classifiers):
        ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
        print clf.fit(X_train, y_train)
        score = clf.score(X_test, y_test)
        print y.shape, X.shape
        if hasattr(clf, "decision_function"):
            Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
            print Z
        else:
            Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]


        Z = Z.reshape(xx.shape)

        ax.contourf(xx, yy, Z, cmap=cm, alpha=.8)
        ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright)

        ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright,
                   alpha=0.6)

        ax.set_xlim(xx.min(), xx.max())
        ax.set_ylim(yy.min(), yy.max())
        ax.set_xticks(())
        ax.set_yticks(())
        ax.set_title(name)
        ax.text(xx.max() - .3, yy.min() + .3, ('%.2f' % score).lstrip('0'),
                size=15, horizontalalignment='right')
        i += 1

figure.subplots_adjust(left=.02, right=.98)
plt.show()

In this case I use three features. What am I doing wrong in the code, Is it something with the X_train and X_test data? With just two features, everything is ok.

my X value:

(array([[ 1.,  1.,  0.],
   [ 1.,  0.,  0.],
   [ 1.,  0.,  0.],
   [ 1.,  0.,  0.],
   [ 1.,  1.,  0.],
   [ 1.,  0.,  0.],
   [ 1.,  0.,  0.],
   [ 3.,  3.,  0.],
   [ 1.,  1.,  0.],
   [ 1.,  1.,  0.],
   [ 0.,  0.,  0.],
   [ 0.,  0.,  0.],
   [ 0.,  0.,  0.],
   [ 0.,  0.,  0.],
   [ 0.,  0.,  0.],
   [ 0.,  0.,  0.],
   [ 4.,  4.,  2.],
   [ 0.,  0.,  0.],
   [ 6.,  3.,  0.],
   [ 5.,  3.,  2.],
   [ 2.,  2.,  0.],
   [ 4.,  4.,  2.],
   [ 2.,  1.,  0.],
   [ 2.,  2.,  0.]]), array([ 1.,  1.,  1.,  1.,  0.,  1.,  1.,  0.,  1.,  1.,  0.,  1.,  1.,
    1.,  1.,  1.,  0.,  1.,  1.,  0.,  1.,  0.,  1.,  1.]))

The first array is the X array and the second array is the y(target) array.

I'm sorry for the bad format = error:

        Traceback (most recent call last):

File "allM.py", line 144, in <module>
mainplot(namePlot,1,2)
File "allM.py", line 117, in mainplot

Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]

File "/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/classification.py", line 191, in predict_proba
neigh_dist, neigh_ind = self.kneighbors(X)

File "/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.py", line 332, in kneighbors
return_distance=return_distance)

File "binary_tree.pxi", line 1298, in sklearn.neighbors.kd_tree.BinaryTree.query (sklearn/neighbors/kd_tree.c:10433)

ValueError: query data dimension must match training data dimension

and this is the X array without putting him into the Dataset "ds".

[[ 1.  1.  0.][ 1.  0.  0.][ 1.  0.  0.][ 1.  0.  0.][ 1.  1.  0.][ 1.  0.  0.][ 1.  0.  0.][ 3.  3.  0.][ 1.  1.  0.][ 1.  1.  0.][ 0.  0.  0.][ 0.  0.  0.][ 0.  0.  0.][ 0.  0.  0.][ 0.  0.  0.][ 0.  0.  0.][ 4.  4.  2.][ 0.  0.  0.][ 6.  3.  0.][ 5.  3.  2.][ 2.  2.  0.][ 4.  4.  2.][ 2.  1.  0.][ 2.  2.  0.]]

355

asked Apr 29 '15 15:04

auronsen

Video Answer

1 Answers

This is happening because clf.predict_proba() requires an array where each row has the same number of elements as the rows in the training data -- in other words an input with shape (num_rows, 3).

When you were working with two-dimensional exemplars this worked because the result of np.c_[xx.ravel(), yy.ravel()] is an array with two-element rows:

print np.c_[xx.ravel(), yy.ravel()].shape
(45738, 2)

These exemplars have two elements because they're created by np.meshgrid which the sample code uses to create a set of inputs to cover a two-dimensional space which will plot nicely. Try passing an array with three-item rows to clf.predict_proba and things should work fine.

If you want to reproduce this specific piece of sample code, you'll have to create a 3D meshgrid, as described in this question on SO. You'll also have plot the results in 3D, where mplot3d will serve as a good starting point, though based on the (admittedly brief) look I gave to the plotting in the sample code, I suspect this may be more trouble than it's worth. I'm not really sure how a 3D analog of those plots even look.

132

answered Nov 06 '22 21:11

mattsilver

Related questions
                            
                                does the order when defining functions in classes in python matter
                            
                                Bokeh Session and Document Polling
                            
                                Using IDE on AWS EC2
                            
                                Creating mTurk HIT from Layout with parameters using boto and python
                            
                                Django: How can I update the profile pictures via ModelForm?
                            
                                Python: double sort
                            
                                Pypi upload without a .pypirc?
                            
                                Python "print" not working when embedded into MPI program
                            
                                In python, how can I change the font size of leaf nodes when generating phylogenetic trees using Bio.Phylo.draw()?
                            
                                error when opening python in terminal
                            
                                Railroad diagram for Python grammar
                            
                                Using Cython with Asyncio (Python 3.4)
                            
                                Ignore additional keyword arguments in python [duplicate]
                            
                                How to print symbols like ● to files in Python
                            
                                How to debug a python - c++ program
                            
                                Create adjacency matrix in python from csv dataset
                            
                                OOP - organising big classes [closed]
                            
                                How can I get Sphinx autosummary to display the docs for an instance attributes?
                            
                                Python ThreadPool from multiprocessing.pool cannot ultilize all CPUs
                            
                                not getting all cookie info using python requests module

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

scitkit-learn query data dimension must match training data dimension

Tags:

python

machine-learning

scikit-learn

auronsen

People also ask

Video Answer

1 Answers

mattsilver

Recent Activity

Donate For Us