How does clustering (especially String clustering) work?

1 Answers

To understand what clustering is imagine a geographical map. You can see many distinct objects (such as houses). Some of them are close to each other, and others are far. Based on this, you can split all objects into groups (such as cities). Clustering algorithms make exactly this thing - they allow you to split your data into groups without previous specifying groups borders.

All clustering algorithms are based on the distance (or likelihood) between 2 objects. On geographical map it is normal distance between 2 houses, in multidimensional space it may be Euclidean distance (in fact, distance between 2 houses on the map also is Euclidean distance). For string comparison you have to use something different. 2 good choices here are Hamming and Levenshtein distance. In your particular case Levenshtein distance if more preferable (Hamming distance works only with the strings of same size).

Now you can use one of existing clustering algorithms. There's plenty of them, but not all can fit your needs. For example, pure k-means, already mentioned here will hardly help you since it requires initial number of groups to find, and with large dictionary of strings it may be 100, 200, 500, 10000 - you just don't know the number. So other algorithms may be more appropriate.

One of them is expectation maximization algorithm. Its advantage is that it can find number of clusters automatically. However, in practice often it gives less precise results than other algorithms, so it is normal to use k-means on top of EM, that is, first find number of clusters and their centers with EM and then use k-means to adjust the result.

Another possible branch of algorithms, that may be suitable for your task, is hierarchical clustering. The result of cluster analysis in this case in not a set of independent groups, but rather tree (hierarchy), where several smaller clusters are grouped into one bigger, and all clusters are finally part of one big cluster. In your case it means that all words are similar to each other up to some degree.

122

answered Oct 31 '22 17:10

ffriend

Related questions
                            
                                UTF8 vs. UTF16 vs. char* vs. what? Someone explain this mess to me!
                            
                                How can I parse the IO String in Haskell?
                            
                                String length in Swift 1.2 and Swift 2.0 [duplicate]
                            
                                C# Object Binary Serialization
                            
                                Count words in a string method?
                            
                                replace special characters in a string python
                            
                                Swift How to get integer from string and convert it into integer
                            
                                What does sizeof(&array) return?
                            
                                what's the best way to hardcode a multiple-line string?
                            
                                Meaning of confusing comment above "string.Empty" in .NET/BCL source?
                            
                                Convert .net String object into base64 encoded string
                            
                                Why is "ss" equal to the German sharp-s character 'ß'?
                            
                                Built-in string formatting vs string concatenation as logging parameter
                            
                                Why new keyword not needed for String
                            
                                Why isn't std::string::max_size a compile-time constant?
                            
                                date.toLocaleDateString is not a function
                            
                                How does \v differ from \x0b or \x0c?
                            
                                How to concatenate a char onto a string in Rust?
                            
                                How many bytes will a string take up?
                            
                                A fast hash function for string in C#

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How does clustering (especially String clustering) work?

Tags:

string

cluster-analysis

data-mining

Renato Dinhani

People also ask

1 Answers

ffriend

Recent Activity

Donate For Us