I have a string with some characters, and I'm looking for the organization of those characters such that it's the most pronounceable possible.
For example, if I have the letters "ascrlyo", there are some arrangements that would be more pronounceable than others. The following may get a "high score":
scaroly crasoly
Where as the following may get a low score:
oascrly yrlcsoa
Is there a simple algorithm I can use? Or better yet, a Python functionality that achieves this?
Thank you!
Start by solving a simpler problem: is a given word pronounceable?
Machine learning 'supervised learning' could be effective here. Train a binary classifier on a training set of dictionary words and scrambled words (assume the scrambled words are all unpronounceable). For features, I suggest counting bigrams and trigrams. My reasoning: unpronounceable trigrams such as 'tns' and 'srh' are rare in dictionary words, even though the individual letters are each common.
The idea is that the trained algorithm will learn to classify words with any rare trigrams as unpronounceable, and words with only common trigrams as pronounceable.
Here's an implementation with scikit-learn http://scikit-learn.org/
import random
def scramble(s):
return "".join(random.sample(s, len(s)))
words = [w.strip() for w in open('/usr/share/dict/words') if w == w.lower()]
scrambled = [scramble(w) for w in words]
X = words+scrambled
y = ['word']*len(words) + ['unpronounceable']*len(scrambled)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
text_clf = Pipeline([
('vect', CountVectorizer(analyzer='char', ngram_range=(1, 3))),
('clf', MultinomialNB())
])
text_clf = text_clf.fit(X_train, y_train)
predicted = text_clf.predict(X_test)
from sklearn import metrics
print(metrics.classification_report(y_test, predicted))
It scores 92% accuracy. Given pronounceability is subjective anyway, this might be as good as it gets.
precision recall f1-score support
scrambled 0.93 0.91 0.92 52409
word 0.92 0.93 0.93 52934
avg / total 0.92 0.92 0.92 105343
It agrees with your examples:
>>> text_clf.predict("scaroly crasoly oascrly yrlcsoa".split())
['word', 'word', 'unpronounceable', 'unpronounceable']
For the curious, here are 10 scrambled words it classifies pronounceable:
And finally 10 dictionary words misclassified as unpronouncable:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With