I am trying to create a distance matrix (to use for clustering) for strings based on customized distance function. I ran the code on a list of 6000 words and it is still running since last 90 minutes. I have 8 GB RAM and Intel-i5, so the problem is with the code only. Here is my code:
library(stringdist)
#Calculate distance between two monograms/bigrams
stringdist2 <- function(word1, word2)
{
#for bigrams - phrases with two words
if (grepl(" ",word1)==TRUE) {
#"Hello World" and "World Hello" are not so different for me
d=min(stringdist(word1, word2),
stringdist(word1, gsub(word2,
pattern = "(.*) (.*)",
repl="\\2,\\1")))
}
#for monograms(words)
else{
#add penalty of 5 points if first character is not same
#brave and crave are more different than brave and bravery
d=ifelse(substr(word1,1,1)==substr(word2,1,1),
stringdist(word1,word2),
stringdist(word1,word2)+5)
}
d
}
#create distance matrix
stringdistmat2 = function(arr)
{
mat = matrix(nrow = length(arr), ncol= length(arr))
for (k in 1:(length(arr)-1))
{
for (j in k:(length(arr)-1))
{
mat[j+1,k] = stringdist2(arr[k],arr[j+1])
}
}
as.dist(mat)
}
test = c("Hello World","World Hello", "Hello Word", "Cello Word")
mydmat = stringdistmat2(test)
> mydmat
1 2 3
2 1
3 1 2
4 2 3 1
I think issue could be that I used loops instead of apply - but then I found at many places that loops are not that inefficient. More importantly I am not skilled enough to use apply for my loops are nested loops are like k in 1:n
and j in k:n
. I wonder if there are other things which can be optimized as well.
Interesting question. So going step by step:
1 - stringdist
function is already vectorized:
#> stringdist("byye", c('bzyte','byte'))
#[1] 2 1
#> stringdist(c('doggy','gadgy'), 'dodgy')
#[1] 1 2
But giving two vectors with the same length, stringdist
will result in looping parallelly on each vector (not resulting in a matrix with cross results), as Map
would do:
#> stringdist(c("byye","alllla"), c('bzyte','byte'))
#[1] 2 6
2 - Rewrite your function so that your new function keeps this vectorized feature:
stringdistFast <- function(word1, word2)
{
d1 = stringdist(word1, word2)
d2 = stringdist(word1, gsub("(.+) (.+)", "\\2 \\1", word2))
ifelse(d1==d2,d1+5*(substr(d1,1,1)!=substr(d2,1,1)),pmin(d1,d2))
}
It is indeed working the same way:
#> stringdistFast("byye", c('bzyte','byte'))
#[1] 2 1
#> stringdistFast("by ye", c('bzyte','byte','ye by'))
#[1] 3 2 0
3 - Rewrite the dismatrix function with only one loopy loop and only on a triangular part (no outer
there, it's slow!):
stringdistmatFast <- function(test)
{
m = diag(0, length(test))
sapply(1:(length(test)-1), function(i)
{
m[,i] <<- c(rep(0,i), stringdistFast(test[i],test[(i+1):length(test)]))
})
`dimnames<-`(m + t(m), list(test,test))
}
4 - Use the function:
#> stringdistmatFast(test)
# Hello World World Hello Hello Word Cello Word
#Hello World 0 0 1 2
#World Hello 0 0 1 2
#Hello Word 1 1 0 1
#Cello Word 2 2 1 0
Loops are indeed very inefficient, and here is a quick example that shows that:
x=rnorm(1000000)
system.time({y1=sum(x)})
system.time({
y2=0
for(i in 1:length(x)){
y2=y2+x[i]
}
})
This is a simple comparison of internal vectorised function sum(), that essentially just calculates sum of all elements in a cycle internally; second function does the same in R code, which makes it call another internal function +
over and over, which is not very efficient.
First of all, you have a couple of mistakes/inconsistencies in your user defined function.
This part:
gsub(word2, pattern = "(.*) (.*)", repl="\\2,\\1")
replaces all white spaces with comas, which automatically adds +1 to distance score (was it intended?)
Second of all, you don't compare first letters for strings that have spaces in them, because then only the first part of the function is executed. That is true even if only the first of the compared words contains space, so "Hello " and "Cello" comparison would be calculated as closer distance than "Hello" and "Cello".
Other then that, your code seems to be easy vectorisable, because all the functions you use are already vectorised: stringdist(),grepl(),gsub(),substr() etc. Basically you perform 3 calculations for each word-pair: simple 'stringdist()', stringdist() of swapped words (if there is space in the first word), and simple comparison of first letters that adds +5 points if they are different.
Here is the code that reproduces your function in a vectorised manner, which gives around 50x speed up on calculating 300x300 matrix:
stringdist3<-function(words1,words2){
m1<-stringdist(words1,words2)
m2<-stringdist(words1,gsub(words2,
pattern = "(.*) (.*)",
repl="\\2,\\1"))
m=mapply(function(x,y) min(x,y),m1,m2)
m3<-5*(substr(words1,1,1)!=substr(words2,1,1) & !grepl(" ",words1))
m3+m
}
stringdistmat3 = function(arr){
outer(arr,arr,function(x,y) stringdist3(x,y))
}
test = c("Hello World","World Hello", "Hello Word", "Cello Word")
arr=sample(test,size=300,replace=TRUE)
system.time({mat = stringdistmat2(arr)})
system.time({
mat2=stringdistmat3(arr)
})
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With