I am trying to generate a string kernel that feeds a support vector classifier. I tried it with a function that calculates the kernel, something like that
def stringkernel(K, G):
for a in range(len(K)):
for b in range(len(G)):
R[a][b] = scipy.exp(editdistance(K[a] , G[b]) ** 2)
return R
And when I pass it to SVC as a parameter I get
clf = svm.SVC(kernel = my_kernel)
clf.fit(data, target)
ValueError: could not convert string to float: photography
where my data is a list of strings and the target is the correspondent class this string belongs to. I have reviewed some questions in stackoverflow regarding this issue, but I think a Bag-of-words representations is not appropiate for this case.
This is a limitation in scikit-learn that has proved hard to get rid of. You can try this workaround. Represent the strings in feature vectors with only one feature, which is really just an index into the table of strings.
>>> data = ["foo", "bar", "baz"]
>>> X = np.arange(len(data)).reshape(-1, 1)
>>> X
array([[0],
[1],
[2]])
Redefine the string kernel function to work on this representation:
>>> def string_kernel(X, Y):
... R = np.zeros((len(x), len(y)))
... for x in X:
... for y in Y:
... i = int(x[0])
... j = int(y[0])
... # simplest kernel ever
... R[i, j] = data[i][0] == data[j][0]
... return R
...
>>> clf = SVC(kernel=string_kernel)
>>> clf.fit(X, ['no', 'yes', 'yes'])
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
kernel=<function string_kernel at 0x7f5988f0bde8>, max_iter=-1,
probability=False, random_state=None, shrinking=True, tol=0.001,
verbose=False)
The downside to this is that to classify new samples, you have to add them to data
, then construct new pseudo-feature vectors for them.
>>> data.extend(["bla", "fool"])
>>> clf.predict([[3], [4]])
array(['yes', 'no'],
dtype='|S3')
(You can get around this by doing more interpretation of your pseudo-features, e.g., looking into a different table for i >= len(X_train)
. But it's still cumbersome.)
This is an ugly hack, but it works (it's slightly less ugly for clustering because there the dataset doesn't change after fit
). Speaking on behalf of the scikit-learn developers, I say a patch to fix this properly is welcome.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With