How to use string kernels in scikit-learn?

Question

I am trying to generate a string kernel that feeds a support vector classifier. I tried it with a function that calculates the kernel, something like that

def stringkernel(K, G):
    for a in range(len(K)):
        for b in range(len(G)):
            R[a][b] = scipy.exp(editdistance(K[a] , G[b]) ** 2)
    return R

And when I pass it to SVC as a parameter I get

 clf = svm.SVC(kernel = my_kernel)
 clf.fit(data, target)

 ValueError: could not convert string to float: photography

where my data is a list of strings and the target is the correspondent class this string belongs to. I have reviewed some questions in stackoverflow regarding this issue, but I think a Bag-of-words representations is not appropiate for this case.

Fred Foo · Accepted Answer

This is a limitation in scikit-learn that has proved hard to get rid of. You can try this workaround. Represent the strings in feature vectors with only one feature, which is really just an index into the table of strings.

>>> data = ["foo", "bar", "baz"]
>>> X = np.arange(len(data)).reshape(-1, 1)
>>> X
array([[0],
       [1],
       [2]])

Redefine the string kernel function to work on this representation:

>>> def string_kernel(X, Y):
...     R = np.zeros((len(x), len(y)))
...     for x in X:
...         for y in Y:
...             i = int(x[0])
...             j = int(y[0])
...             # simplest kernel ever
...             R[i, j] = data[i][0] == data[j][0]
...     return R
... 
>>> clf = SVC(kernel=string_kernel)
>>> clf.fit(X, ['no', 'yes', 'yes'])
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel=<function string_kernel at 0x7f5988f0bde8>, max_iter=-1,
  probability=False, random_state=None, shrinking=True, tol=0.001,
  verbose=False)

The downside to this is that to classify new samples, you have to add them to data, then construct new pseudo-feature vectors for them.

>>> data.extend(["bla", "fool"])
>>> clf.predict([[3], [4]])
array(['yes', 'no'], 
      dtype='|S3')

(You can get around this by doing more interpretation of your pseudo-features, e.g., looking into a different table for i >= len(X_train). But it's still cumbersome.)

This is an ugly hack, but it works (it's slightly less ugly for clustering because there the dataset doesn't change after fit). Speaking on behalf of the scikit-learn developers, I say a patch to fix this properly is welcome.

How to use string kernels in scikit-learn?

Tags:

python

string

svm

scikit-learn

ssierral

1 Answers

Fred Foo

Recent Activity

Donate For Us

How to use string kernels in scikit-learn?

Tags:

python

string

svm

scikit-learn

ssierral

1 Answers

Fred Foo

Related questions

Recent Activity

Donate For Us