Can I use K-means algorithm on a string?

Tags:

I am working on a python project where I study RNA structure evolution (represented as a string for example: "(((...)))" where the parenthesis represent basepairs). The point being is that I have an ideal structure and a population that evolves towards the ideal structure. I have implemented everything however I would like to add a feature where I can get the "number of buckets" ie the k most representative structures in the population at each generation.

I was thinking of using the k-means algorithm but I am not sure how to use it with strings. I found scipy.cluster.vq but I don't know how to use it in my case.

thanks!

907

asked Jun 09 '11 13:06

Doni

1 Answers

One problem you would face if using scipy.cluster.vq.kmeans is that that function uses Euclidean distance to measure closeness. To shoe-horn your problem into one solveable by k-means clustering, you'd have to find a way to convert your strings into numerical vectors and be able to justify using Euclidean distance as a reasonable measure of closeness.

That seems... difficult. Perhaps you are looking for Levenshtein distance instead?

Note there are variants of the K-means algorithm that can work with non-Euclideance distance metrics (such as Levenshtein distance). K-medoids (aka PAM), for instance, can be applied to data with an arbitrary distance metric.

For example, using Pycluster's implementation of k-medoids, and nltk's implementation of Levenshtein distance,

import nltk.metrics.distance as distance
import Pycluster as PC

words = ['apple', 'Doppler', 'applaud', 'append', 'barker', 
         'baker', 'bismark', 'park', 'stake', 'steak', 'teak', 'sleek']

dist = [distance.edit_distance(words[i], words[j]) 
        for i in range(1, len(words))
        for j in range(0, i)]

labels, error, nfound = PC.kmedoids(dist, nclusters=3)
cluster = dict()
for word, label in zip(words, labels):
    cluster.setdefault(label, []).append(word)
for label, grp in cluster.items():
    print(grp)

yields a result like

['apple', 'Doppler', 'applaud', 'append']
['stake', 'steak', 'teak', 'sleek']
['barker', 'baker', 'bismark', 'park']

116

answered Oct 15 '22 19:10

unutbu

Related questions
                            
                                Python Pandas prevent line break in cell
                            
                                How to get randomly select n elements from a list using in numpy?
                            
                                pandas merge on columns with different names and avoid duplicates [duplicate]
                            
                                Dictionary of Pandas Dataframes to MultiIndex Dataframe
                            
                                python - how to append numpy array to a pandas dataframe
                            
                                Reconstructing an image after using extract_image_patches
                            
                                How can OrderedDict know about the element order of an already instantiated dict?
                            
                                PIL how to scale image in relation to the text drawn on image
                            
                                How to limit field access on a model based on user type on Graphene/Django?
                            
                                Using Python decorators to retry request
                            
                                Repeating x axis labels for all facets using FacetGrid in seaborn
                            
                                AttributeError: 'str' object has no attribute 'view' in Seaborn , Scatterplot
                            
                                How to resolve "ValueError: Empty module name"?
                            
                                Create static instances of a class inside said class in Python
                            
                                How do I save data from a ModelForm to database in django?
                            
                                Does Django have BDD testing tools comparable to Rails' testing tools?
                            
                                Python "dir" equivalent in Clojure
                            
                                key corresponding to maximum value in python dictionary
                            
                                how to get the Seconds of the time using python
                            
                                Converting ConfigParser values to python data types

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Can I use K-means algorithm on a string?

Tags:

python

algorithm

cluster-analysis

k-means

bioinformatics

Doni

People also ask

1 Answers

unutbu

Recent Activity

Donate For Us