Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Label Propagation in sklearn is classifying every vector as 1

I have 2000 labelled data (7 different labels) and about 100K unlabeled data and I am trying to use sklearn.semi_supervised.LabelPropagation. The data has 1024 dimensions. My problem is that the classifier is labeling everything as 1. My code looks like this:

X_unlabeled = X_unlabeled[:10000, :]
X_both = np.vstack((X_train, X_unlabeled))
y_both = np.append(y_train, -np.ones((X_unlabeled.shape[0],)))
clf = LabelPropagation(max_iter=100).fit(X_both, y_both)
y_pred = clf.predict(X_test)

y_pred is all ones. Also, X_train is 2000x1024 and X_unlabeled is a subset of the unlabeled data which is 10000x1024.

I also get this error upon calling fit on the classifier:

/usr/local/lib/python2.7/site-packages/sklearn/semi_supervised/label_propagation.py:255: RuntimeWarning: invalid value encountered in divide self.label_distributions_ /= normalizer

like image 443
Andrew Danks Avatar asked Nov 02 '22 10:11

Andrew Danks


1 Answers

Have you tried different values for the gamma parameter ? As the graph is constructed by computing an rbf kernel, the computation includes an exponential and the python exponential functions return 0 if the value is a too big negative number (see http://computer-programming-forum.com/56-python/ef71e144330ffbc2.htm). And if the graph is filled with 0, the label_distributions_ is filled with "nan" (because of normalization) and a warning appears. (be careful, the gamma value in scikit implementation is multiplied to the euclidean distance, it's not the same thing as in the Zhu paper.)

like image 64
chloe Avatar answered Nov 08 '22 06:11

chloe