Text clustering with Levenshtein distances

Tags:

I have a set (2k - 4k) of small strings (3-6 characters) and I want to cluster them. Since I use strings, previous answers on How does clustering (especially String clustering) work?, informed me that Levenshtein distance is good to be used as a distance function for strings. Also, since I do not know in advance the number of clusters, hierarchical clustering is the way to go and not k-means.

Although I get the problem in its abstract form, I do not know what is the easie way to actually do it. For example, is MATLAB or R a better choice for the actual implementation of hierarchical clustering with the custom function (Levenshtein distance). For both software, one may easily find a Levenshtein distance implementation. The clustering part seems harder. For example Clustering text in MATLAB calculates the distance array for all strings, but I cannot understand how to use the distance array to actually get the clustering. Can you any of you gurus show me the way to how to implement the hierarchical clustering in either MATLAB or R with a custom function?

905

asked Feb 02 '14 14:02

Alexandros

2 Answers

This may be a bit simplistic, but here's a code example that uses hierarchical clustering based on Levenshtein distance in R.

Click to copy

set.seed(1) rstr <- function(n,k){   # vector of n random char(k) strings   sapply(1:n,function(i){do.call(paste0,as.list(sample(letters,k,replace=T)))}) }  str<- c(paste0("aa",rstr(10,3)),paste0("bb",rstr(10,3)),paste0("cc",rstr(10,3))) # Levenshtein Distance d  <- adist(str) rownames(d) <- str hc <- hclust(as.dist(d)) plot(hc) rect.hclust(hc,k=3) df <- data.frame(str,cutree(hc,k=3))

In this example, we create a set of 30 random char(5) strings artificially in 3 groups (starting with "aa", "bb", and "cc"). We calculate the Levenshtein distance matrix using adist(...), and we run heirarchal clustering using hclust(...). Then we cut the dendrogram into three clusters with cutree(...) and append the cluster id's to the original strings.

119

answered Sep 27 '22 23:09

jlhoward

ELKI includes Levenshtein distance, and offers a wide choice of advanced clustering algorithms, for example OPTICS clustering.

Text clustering support was contributed by Felix Stahlberg, as part of his work on:

Stahlberg, F., Schlippe, T., Vogel, S., & Schultz, T.
Word segmentation through cross-lingual word-to-phoneme alignment.
Spoken Language Technology Workshop (SLT), 2012 IEEE. IEEE, 2012.

We would of course appreciate additional contributions.

answered Sep 27 '22 22:09

Erich Schubert

Related questions
                            
                                makePSOCKcluster hangs on win x64 after calling system
                            
                                Producing PDF report using knitr (LaTeX) from Shiny application
                            
                                How to jitter text to avoid overlap in a ggplot2 scatterplot?
                            
                                What is the correct/standard way to check if difference is smaller than machine precision?
                            
                                Can you make R print more detailed error messages?
                            
                                How do i get the web browser password store to remember R/Shiny passwords?
                            
                                R avoiding "restarting interrupted promise evaluation" warning
                            
                                Upload a file over 2.15 GB in R
                            
                                Error calling serialize R function
                            
                                How to effectively deal with uncompressed saves during package check?
                            
                                In R, what does "loaded via a namespace (and not attached)" mean?
                            
                                When writing my own R package, I can't seem to get other packages to import correctly
                            
                                Most efficient list to data.frame method?
                            
                                Pass arguments into function within a function
                            
                                How do I show the source code of an S4 function in a package?
                            
                                Count number of columns by a condition (>) for each row
                            
                                What are the differences between vector, matrix and array data types?
                            
                                How to read a .csv file containing apostrophes into R?
                            
                                change thickness of the whole line geom_boxplot()
                            
                                Annotate ggplot2 facets with number of observations per facet [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Text clustering with Levenshtein distances

Tags:

r

matlab

cluster-analysis

levenshtein-distance

hierarchical-clustering

Alexandros

People also ask

2 Answers

jlhoward

Erich Schubert

Recent Activity

Donate For Us