Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use string kernels in scikit-learn?

I am trying to generate a string kernel that feeds a support vector classifier. I tried it with a function that calculates the kernel, something like that

def stringkernel(K, G):
    for a in range(len(K)):
        for b in range(len(G)):
            R[a][b] = scipy.exp(editdistance(K[a] , G[b]) ** 2)
    return R

And when I pass it to SVC as a parameter I get

 clf = svm.SVC(kernel = my_kernel)
 clf.fit(data, target)

 ValueError: could not convert string to float: photography

where my data is a list of strings and the target is the correspondent class this string belongs to. I have reviewed some questions in stackoverflow regarding this issue, but I think a Bag-of-words representations is not appropiate for this case.

like image 791
ssierral Avatar asked Mar 19 '23 04:03

ssierral


1 Answers

This is a limitation in scikit-learn that has proved hard to get rid of. You can try this workaround. Represent the strings in feature vectors with only one feature, which is really just an index into the table of strings.

>>> data = ["foo", "bar", "baz"]
>>> X = np.arange(len(data)).reshape(-1, 1)
>>> X
array([[0],
       [1],
       [2]])

Redefine the string kernel function to work on this representation:

>>> def string_kernel(X, Y):
...     R = np.zeros((len(x), len(y)))
...     for x in X:
...         for y in Y:
...             i = int(x[0])
...             j = int(y[0])
...             # simplest kernel ever
...             R[i, j] = data[i][0] == data[j][0]
...     return R
... 
>>> clf = SVC(kernel=string_kernel)
>>> clf.fit(X, ['no', 'yes', 'yes'])
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel=<function string_kernel at 0x7f5988f0bde8>, max_iter=-1,
  probability=False, random_state=None, shrinking=True, tol=0.001,
  verbose=False)

The downside to this is that to classify new samples, you have to add them to data, then construct new pseudo-feature vectors for them.

>>> data.extend(["bla", "fool"])
>>> clf.predict([[3], [4]])
array(['yes', 'no'], 
      dtype='|S3')

(You can get around this by doing more interpretation of your pseudo-features, e.g., looking into a different table for i >= len(X_train). But it's still cumbersome.)

This is an ugly hack, but it works (it's slightly less ugly for clustering because there the dataset doesn't change after fit). Speaking on behalf of the scikit-learn developers, I say a patch to fix this properly is welcome.

like image 50
Fred Foo Avatar answered Mar 21 '23 03:03

Fred Foo