Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

String Kernels in R

I have been using the stringdot function available in R in "Kernlab" package.Here is my code

library(kernlab)
x <- c("1","2","3")
y <- c("3","2","1")
lst <- list(x, y)
sk <- stringdot(length = 2, lambda = 1.2, type = "exponential", normalized = TRUE)
q <- kernelMatrix(sk,lst)

As per my knowledge, the exponential kernel will create substrings of length 2. For example, here the strings will be 1-2,1-3,2-3 from the first vector and 3-2,3-1,2-1 from the second vector. It will try to match the input by creating various substrings of the given length and reducing the weight of the substrings as per the given value of lambda.

As per my expectations, the output should contain a value of 1 for (x,x) and (y,y) and a value 0 for (x,y), as there are no common substrings between the given inputs, but the output is showing the value of (x,y) pair to be 0.4723.

I don't understand why the similarity between x and y is 0.4723.

like image 273
does_it_matter Avatar asked Mar 17 '17 11:03

does_it_matter


1 Answers

By looking at the source of kernelMatrix and stringdot, it's possible to work out what's happening with your input.

When a list is passed as x to kernelMatrix, it does the following:

if (is(x, "list")) 
  x <- sapply(x, paste, collapse = "")

In your case, this means that your lst input becomes c("123", "321").

kernelMatrix then takes this vector, and generates a matrix with the following pattern (where sk is the stringkernel function):

sk("123", "123")    sk("123", "321")
                    sk("321", "321")

The lower left cell is then filled in with the upper right, and the whole matrix is normalised by dividing by the square-root of the upper-left cell multiplied by the lower right.

You can check the individual value matches by doing this:

stringdot(type = "exponential", lambda = 1.2)(123, 321)
#[1] 0.4723893

It's worth noting that the length parameter does nothing for type = "exponential". Each type of stringkernel only has zero or one parameters, and for exponential it's lambda which gives the decay. The substring weight decays as the matching substring gets shorter, with lambda as the decay factor.

stringdot(type = "spectrum"), as described in the manual, uses length and will only match subsequences of at least that length. Since there is no matching substring >= 2 characters between 123 and 321, the comparison comes out as zero.

It should also be noted that a newline character ("\n") is appended to each string, and that even single character matches will have a result >0 using type = "exponential", so it's not possible to get a result of zero. e.g.

stringdot(type = "exponential", lambda = 1.2)("blowfish", "mage")
#[1] 0.05274495

Finally, it looks as though @Rahul wanted an implementation in R of Lodhi's 2002 algorithm. kernlab does not implement this algorithm, and I'm not aware of an R package that does. There seems to be a python implementation available at github, but I've not checked whether the code runs let alone gives the output requested. It would be possible for someone with an interest to reimplement the python code in R if that were felt useful/necessary.

Additional edit to address comment:

Results of normalised stringkernel functions depend on the degree to which each string is similar to itself.

sk_u <- stringdot(type = "exponential", lambda = 1.2, normalized = FALSE)
sk_n <- stringdot(type = "exponential", lambda = 1.2, normalized = TRUE)

lapply(list(unnormalised = sk_u, normalised = sk_n), function(f) {
  c(
    "ab,xyzabqr" = f("ab", "xyzabqr"),
    "ab,abpmnop" = f("ab", "abpmnop"),
    "ab,ab" = f("ab"),
    "xyzabqr,xyzabqr" = f("xyzabqr"),
    "abpmnop,abpmnop" = f("abpmnop")
  )
})

#$unnormalised
#     ab,xyzabqr      ab,abpmnop           ab,ab xyzabqr,xyzabqr abpmnop,abpmnop 
#       3.194444        3.194444        4.467593       20.814201       22.480868 

#$normalised
#     ab,xyzabqr      ab,abpmnop           ab,ab xyzabqr,xyzabqr abpmnop,abpmnop 
#      0.3312674       0.3187513       1.0000000       1.0000000       1.0000000 

It can be seen that unnormalised, the result is the same for either comparison. However, since the normalised result is equal to (e.g.) sk_u("ab", "xyzabqr") / sqrt(sk_u("ab") * sk_u("xyzabqr")), the reason that sk_n("ab", "xyzabqr") scores higher relates to the fact that there are two p's in "abpmnop".

like image 159
Nick Kennedy Avatar answered Nov 17 '22 20:11

Nick Kennedy