Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scitkit-learn query data dimension must match training data dimension

I'm trying to use this code from the scikit learn site:

http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html

I'm using my own data. My problem is, I have a lot more than two features. If I want to "expand" the features from 2 to 3 or 4....

I'm getting:

"query data dimension must match training data dimension"

def machine():
with open("test.txt",'r') as csvr:

    reader= csv.reader(csvr,delimiter='\t')

    for i,row in enumerate(reader):

        if i==0:
            pass
        elif '' in row[2:]:
            pass
        else:
            liste.append(map(float,row[2:]))

a = np.array(liste)
h = .02 
names = ["Nearest Neighbors", "Linear SVM", "RBF SVM", "Decision Tree",
         "Random Forest", "AdaBoost", "Naive Bayes", "LDA", "QDA"]
classifiers = [
    KNeighborsClassifier(1),
    SVC(kernel="linear", C=0.025),
    SVC(gamma=2, C=1),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    AdaBoostClassifier(),
    GaussianNB(),
    LDA(),
    QDA()]



X = a[:,:3]
y = np.ravel(a[:,13])

linearly_separable = (X, y)
datasets =[linearly_separable]
figure = plt.figure(figsize=(27, 9))
i = 1

for ds in datasets:
    X, y = ds

    X = StandardScaler().fit_transform(X)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)

    x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))

    cm = plt.cm.RdBu
    cm_bright = ListedColormap(['#FF0000', '#0000FF'])
    ax = plt.subplot(len(datasets), len(classifiers) + 1, i)

    ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright)

    ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright, alpha=0.6)
    ax.set_xlim(xx.min(), xx.max())
    ax.set_ylim(yy.min(), yy.max())
    ax.set_xticks(())
    ax.set_yticks(())
    i += 1

    for name, clf in zip(names, classifiers):
        ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
        print clf.fit(X_train, y_train)
        score = clf.score(X_test, y_test)
        print y.shape, X.shape
        if hasattr(clf, "decision_function"):
            Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
            print Z
        else:
            Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]


        Z = Z.reshape(xx.shape)

        ax.contourf(xx, yy, Z, cmap=cm, alpha=.8)
        ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright)

        ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright,
                   alpha=0.6)

        ax.set_xlim(xx.min(), xx.max())
        ax.set_ylim(yy.min(), yy.max())
        ax.set_xticks(())
        ax.set_yticks(())
        ax.set_title(name)
        ax.text(xx.max() - .3, yy.min() + .3, ('%.2f' % score).lstrip('0'),
                size=15, horizontalalignment='right')
        i += 1

figure.subplots_adjust(left=.02, right=.98)
plt.show()

In this case I use three features. What am I doing wrong in the code, Is it something with the X_train and X_test data? With just two features, everything is ok.

my X value:

(array([[ 1.,  1.,  0.],
   [ 1.,  0.,  0.],
   [ 1.,  0.,  0.],
   [ 1.,  0.,  0.],
   [ 1.,  1.,  0.],
   [ 1.,  0.,  0.],
   [ 1.,  0.,  0.],
   [ 3.,  3.,  0.],
   [ 1.,  1.,  0.],
   [ 1.,  1.,  0.],
   [ 0.,  0.,  0.],
   [ 0.,  0.,  0.],
   [ 0.,  0.,  0.],
   [ 0.,  0.,  0.],
   [ 0.,  0.,  0.],
   [ 0.,  0.,  0.],
   [ 4.,  4.,  2.],
   [ 0.,  0.,  0.],
   [ 6.,  3.,  0.],
   [ 5.,  3.,  2.],
   [ 2.,  2.,  0.],
   [ 4.,  4.,  2.],
   [ 2.,  1.,  0.],
   [ 2.,  2.,  0.]]), array([ 1.,  1.,  1.,  1.,  0.,  1.,  1.,  0.,  1.,  1.,  0.,  1.,  1.,
    1.,  1.,  1.,  0.,  1.,  1.,  0.,  1.,  0.,  1.,  1.]))

The first array is the X array and the second array is the y(target) array.

I'm sorry for the bad format = error:

        Traceback (most recent call last):

File "allM.py", line 144, in <module>
mainplot(namePlot,1,2)
File "allM.py", line 117, in mainplot

Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]

File "/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/classification.py", line 191, in predict_proba
neigh_dist, neigh_ind = self.kneighbors(X)

File "/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.py", line 332, in kneighbors
return_distance=return_distance)

File "binary_tree.pxi", line 1298, in sklearn.neighbors.kd_tree.BinaryTree.query (sklearn/neighbors/kd_tree.c:10433)

ValueError: query data dimension must match training data dimension

and this is the X array without putting him into the Dataset "ds".

[[ 1.  1.  0.][ 1.  0.  0.][ 1.  0.  0.][ 1.  0.  0.][ 1.  1.  0.][ 1.  0.  0.][ 1.  0.  0.][ 3.  3.  0.][ 1.  1.  0.][ 1.  1.  0.][ 0.  0.  0.][ 0.  0.  0.][ 0.  0.  0.][ 0.  0.  0.][ 0.  0.  0.][ 0.  0.  0.][ 4.  4.  2.][ 0.  0.  0.][ 6.  3.  0.][ 5.  3.  2.][ 2.  2.  0.][ 4.  4.  2.][ 2.  1.  0.][ 2.  2.  0.]]
like image 355
auronsen Avatar asked Apr 29 '15 15:04

auronsen


People also ask

How to scale last dimension of training data?

Last dimension should match dimension of training data. If True, use a dualtree algorithm. Otherwise, use a single-tree algorithm. Dual tree algorithms can have better scaling for large N. counts [i] contains the number of pairs of points with distance less than or equal to r [i]

What is the log-likelihood of each sample in X last dimension?

Last dimension should match dimension of training data (n_features). Log-likelihood of each sample in X. These are normalized to be probability densities, so values will be low for high-dimensional data.

Is it possible to randomize training data and query data?

Checked dimensions of both training data and query data, both are the same. However, with 'random' method, it works perfectly. Pls, help with this issue. @amit-sharma, I see the same issue with one of the github issues in responsible-ai-toolbox at microsoft/responsible-ai-toolbox#1151.

What is the difference between train data and test data?

So your train data have only 1 feature and test data have n features. You can use DataFrame.shape () (X_train.shape ()) for checking shape of your dataframes. There is another problem that will cause the poor performance of your model in production i.e data leakage.


Video Answer


1 Answers

This is happening because clf.predict_proba() requires an array where each row has the same number of elements as the rows in the training data -- in other words an input with shape (num_rows, 3).

When you were working with two-dimensional exemplars this worked because the result of np.c_[xx.ravel(), yy.ravel()] is an array with two-element rows:

print np.c_[xx.ravel(), yy.ravel()].shape
(45738, 2)

These exemplars have two elements because they're created by np.meshgrid which the sample code uses to create a set of inputs to cover a two-dimensional space which will plot nicely. Try passing an array with three-item rows to clf.predict_proba and things should work fine.

If you want to reproduce this specific piece of sample code, you'll have to create a 3D meshgrid, as described in this question on SO. You'll also have plot the results in 3D, where mplot3d will serve as a good starting point, though based on the (admittedly brief) look I gave to the plotting in the sample code, I suspect this may be more trouble than it's worth. I'm not really sure how a 3D analog of those plots even look.

like image 132
mattsilver Avatar answered Nov 06 '22 21:11

mattsilver