I am working on a python project where I study RNA structure evolution (represented as a string for example: "(((...)))" where the parenthesis represent basepairs). The point being is that I have an ideal structure and a population that evolves towards the ideal structure. I have implemented everything however I would like to add a feature where I can get the "number of buckets" ie the k most representative structures in the population at each generation.
I was thinking of using the k-means algorithm but I am not sure how to use it with strings. I found scipy.cluster.vq but I don't know how to use it in my case.
thanks!
K-means is a popular clustering algorithm which is widely used in anomaly-based intrusion detection. It tries to classify a given data set into k (a predefined number) categories.
The most important limitations of Simple k-means are: The user has to specify k (the number of clusters) in the beginning. k-means can only handle numerical data. k-means assumes that we deal with spherical clusters and that each cluster has roughly equal numbers of observations.
The k-Means algorithm is not applicable to categorical data, as categorical variables are discrete and do not have any natural origin. So computing euclidean distance for such as space is not meaningful.
K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K.
One problem you would face if using scipy.cluster.vq.kmeans
is that that function uses Euclidean distance to measure closeness. To shoe-horn your problem into one solveable by k-means
clustering, you'd have to find a way to convert your strings into numerical vectors and be able to justify using Euclidean distance as a reasonable measure of closeness.
That seems... difficult. Perhaps you are looking for Levenshtein distance instead?
Note there are variants of the K-means algorithm that can work with non-Euclideance distance metrics (such as Levenshtein distance). K-medoids
(aka PAM), for instance, can be applied to data with an arbitrary distance metric.
For example, using Pycluster
's implementation of k-medoids
, and nltk
's implementation of Levenshtein distance,
import nltk.metrics.distance as distance
import Pycluster as PC
words = ['apple', 'Doppler', 'applaud', 'append', 'barker',
'baker', 'bismark', 'park', 'stake', 'steak', 'teak', 'sleek']
dist = [distance.edit_distance(words[i], words[j])
for i in range(1, len(words))
for j in range(0, i)]
labels, error, nfound = PC.kmedoids(dist, nclusters=3)
cluster = dict()
for word, label in zip(words, labels):
cluster.setdefault(label, []).append(word)
for label, grp in cluster.items():
print(grp)
yields a result like
['apple', 'Doppler', 'applaud', 'append']
['stake', 'steak', 'teak', 'sleek']
['barker', 'baker', 'bismark', 'park']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With