I have been using the stringdot function available in R in "Kernlab" package.Here is my code
library(kernlab)
x <- c("1","2","3")
y <- c("3","2","1")
lst <- list(x, y)
sk <- stringdot(length = 2, lambda = 1.2, type = "exponential", normalized = TRUE)
q <- kernelMatrix(sk,lst)
As per my knowledge, the exponential kernel will create substrings of length 2. For example, here the strings will be 1-2,1-3,2-3
from the first vector and 3-2,3-1,2-1
from the second vector. It will try to match the input by creating various substrings of the given length and reducing the weight of the substrings as per the given value of lambda
.
As per my expectations, the output should contain a value of 1 for (x,x) and (y,y) and a value 0 for (x,y), as there are no common substrings between the given inputs, but the output is showing the value of (x,y) pair to be 0.4723.
I don't understand why the similarity between x and y is 0.4723.
By looking at the source of kernelMatrix
and stringdot
, it's possible to work out what's happening with your input.
When a list is passed as x
to kernelMatrix
, it does the following:
if (is(x, "list"))
x <- sapply(x, paste, collapse = "")
In your case, this means that your lst
input becomes c("123", "321")
.
kernelMatrix
then takes this vector, and generates a matrix with the following pattern (where sk
is the stringkernel function):
sk("123", "123") sk("123", "321")
sk("321", "321")
The lower left cell is then filled in with the upper right, and the whole matrix is normalised by dividing by the square-root of the upper-left cell multiplied by the lower right.
You can check the individual value matches by doing this:
stringdot(type = "exponential", lambda = 1.2)(123, 321)
#[1] 0.4723893
It's worth noting that the length
parameter does nothing for type = "exponential"
. Each type of stringkernel
only has zero or one parameters, and for exponential
it's lambda
which gives the decay. The substring weight decays as the matching substring gets shorter, with lambda
as the decay factor.
stringdot(type = "spectrum")
, as described in the manual, uses length
and will only match subsequences of at least that length. Since there is no matching substring >= 2 characters between 123
and 321
, the comparison comes out as zero.
It should also be noted that a newline character ("\n"
) is appended to each string, and that even single character matches will have a result >0 using type = "exponential"
, so it's not possible to get a result of zero. e.g.
stringdot(type = "exponential", lambda = 1.2)("blowfish", "mage")
#[1] 0.05274495
Finally, it looks as though @Rahul wanted an implementation in R of Lodhi's 2002 algorithm. kernlab
does not implement this algorithm, and I'm not aware of an R package that does. There seems to be a python implementation available at github, but I've not checked whether the code runs let alone gives the output requested. It would be possible for someone with an interest to reimplement the python code in R if that were felt useful/necessary.
Results of normalised stringkernel functions depend on the degree to which each string is similar to itself.
sk_u <- stringdot(type = "exponential", lambda = 1.2, normalized = FALSE)
sk_n <- stringdot(type = "exponential", lambda = 1.2, normalized = TRUE)
lapply(list(unnormalised = sk_u, normalised = sk_n), function(f) {
c(
"ab,xyzabqr" = f("ab", "xyzabqr"),
"ab,abpmnop" = f("ab", "abpmnop"),
"ab,ab" = f("ab"),
"xyzabqr,xyzabqr" = f("xyzabqr"),
"abpmnop,abpmnop" = f("abpmnop")
)
})
#$unnormalised
# ab,xyzabqr ab,abpmnop ab,ab xyzabqr,xyzabqr abpmnop,abpmnop
# 3.194444 3.194444 4.467593 20.814201 22.480868
#$normalised
# ab,xyzabqr ab,abpmnop ab,ab xyzabqr,xyzabqr abpmnop,abpmnop
# 0.3312674 0.3187513 1.0000000 1.0000000 1.0000000
It can be seen that unnormalised, the result is the same for either comparison. However, since the normalised result is equal to (e.g.) sk_u("ab", "xyzabqr") / sqrt(sk_u("ab") * sk_u("xyzabqr"))
, the reason that sk_n("ab", "xyzabqr")
scores higher relates to the fact that there are two p's in "abpmnop"
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With