I'm trying to use this code from the scikit learn site:
http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html
I'm using my own data. My problem is, I have a lot more than two features. If I want to "expand" the features from 2 to 3 or 4....
I'm getting:
"query data dimension must match training data dimension"
def machine():
with open("test.txt",'r') as csvr:
reader= csv.reader(csvr,delimiter='\t')
for i,row in enumerate(reader):
if i==0:
pass
elif '' in row[2:]:
pass
else:
liste.append(map(float,row[2:]))
a = np.array(liste)
h = .02
names = ["Nearest Neighbors", "Linear SVM", "RBF SVM", "Decision Tree",
"Random Forest", "AdaBoost", "Naive Bayes", "LDA", "QDA"]
classifiers = [
KNeighborsClassifier(1),
SVC(kernel="linear", C=0.025),
SVC(gamma=2, C=1),
DecisionTreeClassifier(max_depth=5),
RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
AdaBoostClassifier(),
GaussianNB(),
LDA(),
QDA()]
X = a[:,:3]
y = np.ravel(a[:,13])
linearly_separable = (X, y)
datasets =[linearly_separable]
figure = plt.figure(figsize=(27, 9))
i = 1
for ds in datasets:
X, y = ds
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
cm = plt.cm.RdBu
cm_bright = ListedColormap(['#FF0000', '#0000FF'])
ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright)
ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright, alpha=0.6)
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set_xticks(())
ax.set_yticks(())
i += 1
for name, clf in zip(names, classifiers):
ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
print clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
print y.shape, X.shape
if hasattr(clf, "decision_function"):
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
print Z
else:
Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
Z = Z.reshape(xx.shape)
ax.contourf(xx, yy, Z, cmap=cm, alpha=.8)
ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright)
ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright,
alpha=0.6)
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set_xticks(())
ax.set_yticks(())
ax.set_title(name)
ax.text(xx.max() - .3, yy.min() + .3, ('%.2f' % score).lstrip('0'),
size=15, horizontalalignment='right')
i += 1
figure.subplots_adjust(left=.02, right=.98)
plt.show()
In this case I use three features. What am I doing wrong in the code, Is it something with the X_train and X_test data? With just two features, everything is ok.
my X value:
(array([[ 1., 1., 0.],
[ 1., 0., 0.],
[ 1., 0., 0.],
[ 1., 0., 0.],
[ 1., 1., 0.],
[ 1., 0., 0.],
[ 1., 0., 0.],
[ 3., 3., 0.],
[ 1., 1., 0.],
[ 1., 1., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.],
[ 4., 4., 2.],
[ 0., 0., 0.],
[ 6., 3., 0.],
[ 5., 3., 2.],
[ 2., 2., 0.],
[ 4., 4., 2.],
[ 2., 1., 0.],
[ 2., 2., 0.]]), array([ 1., 1., 1., 1., 0., 1., 1., 0., 1., 1., 0., 1., 1.,
1., 1., 1., 0., 1., 1., 0., 1., 0., 1., 1.]))
The first array is the X array and the second array is the y(target) array.
I'm sorry for the bad format = error:
Traceback (most recent call last):
File "allM.py", line 144, in <module>
mainplot(namePlot,1,2)
File "allM.py", line 117, in mainplot
Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
File "/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/classification.py", line 191, in predict_proba
neigh_dist, neigh_ind = self.kneighbors(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.py", line 332, in kneighbors
return_distance=return_distance)
File "binary_tree.pxi", line 1298, in sklearn.neighbors.kd_tree.BinaryTree.query (sklearn/neighbors/kd_tree.c:10433)
ValueError: query data dimension must match training data dimension
and this is the X array without putting him into the Dataset "ds".
[[ 1. 1. 0.][ 1. 0. 0.][ 1. 0. 0.][ 1. 0. 0.][ 1. 1. 0.][ 1. 0. 0.][ 1. 0. 0.][ 3. 3. 0.][ 1. 1. 0.][ 1. 1. 0.][ 0. 0. 0.][ 0. 0. 0.][ 0. 0. 0.][ 0. 0. 0.][ 0. 0. 0.][ 0. 0. 0.][ 4. 4. 2.][ 0. 0. 0.][ 6. 3. 0.][ 5. 3. 2.][ 2. 2. 0.][ 4. 4. 2.][ 2. 1. 0.][ 2. 2. 0.]]
Last dimension should match dimension of training data. If True, use a dualtree algorithm. Otherwise, use a single-tree algorithm. Dual tree algorithms can have better scaling for large N. counts [i] contains the number of pairs of points with distance less than or equal to r [i]
Last dimension should match dimension of training data (n_features). Log-likelihood of each sample in X. These are normalized to be probability densities, so values will be low for high-dimensional data.
Checked dimensions of both training data and query data, both are the same. However, with 'random' method, it works perfectly. Pls, help with this issue. @amit-sharma, I see the same issue with one of the github issues in responsible-ai-toolbox at microsoft/responsible-ai-toolbox#1151.
So your train data have only 1 feature and test data have n features. You can use DataFrame.shape () (X_train.shape ()) for checking shape of your dataframes. There is another problem that will cause the poor performance of your model in production i.e data leakage.
This is happening because clf.predict_proba()
requires an array where each row has the same number of elements as the rows in the training data -- in other words an input with shape (num_rows, 3)
.
When you were working with two-dimensional exemplars this worked because the result of np.c_[xx.ravel(), yy.ravel()]
is an array with two-element rows:
print np.c_[xx.ravel(), yy.ravel()].shape
(45738, 2)
These exemplars have two elements because they're created by np.meshgrid
which the sample code uses to create a set of inputs to cover a two-dimensional space which will plot nicely. Try passing an array with three-item rows to clf.predict_proba
and things should work fine.
If you want to reproduce this specific piece of sample code, you'll have to create a 3D meshgrid, as described in this question on SO. You'll also have plot the results in 3D, where mplot3d will serve as a good starting point, though based on the (admittedly brief) look I gave to the plotting in the sample code, I suspect this may be more trouble than it's worth. I'm not really sure how a 3D analog of those plots even look.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With