I'm looking for a fully accurate statement of an algorithm to count syllables in words. What I'm finding when I research is inconsistent or what I know to generate incorrect results. Does anyone have any suggestions of how to accomplish this? Thanks.
The algorithm I'm using now:
Are there any more rules I'm missing? I'm trying to determine in testing for my incorrect results if the algorithm I'm using is wrong or my implementation of it.
Wondering why it's is 1 syllable? Contact Us!
Ambiguity is a huge issue in natural language processing, but some tasks can actually handle with the ambiguity with nice accuracy. It turns out syllabification is one of them, so don't listen to the other answers. :)
You could come up with algorithms achieving correct syllabification virtually throughout the English vocabulary, but it seems complicated to program correctly.
As always, when hand-made algorithms don't help too much, Natural Language Processing researchers use hand-tagged corpora containing the correct answers for given words. Learnings algorithms are then used and often provide great accuracy. You can use LingPipe's syllabification (see "English syllabification") which follows this approach.
English only has so many words, which is how we came up with dictionaries. Such dictionaries often contain the correct syllabification. You could scrape reference.com. For example, the undulate entry contains « un·du·late », which is enough to know there are three syllables.
Other such dictionaries include Answers.com, The Free Dictionary, Merriam-Webster, and so on. Do read the Terms and Conditions, automated retrieval may not be allowed. And different dictionaries don't always agree with each other.
It won't help with new words or proper nouns, but I'd say it's going to be the most accurate method.
Another related problem got a lot more exposure: hyphenation. But don't use that! It is used in typesetting programs such as LaTeX, but only aims to provide some of the correct hyphens, without ever providing an incorrect one (high precision, low recall). It's interesting to note that there only are 14 exceptions, eg. project which has a different hyphenation depending on the part-of-speech (verb or noun).
If you decide that it's enough for you needs, note that a few implementations of the TeX hyphenation algorithm exist in other languages, such as Python, Perl or Ruby.
I'm looking for a fully accurate statement of an algorithm to count syllables in words
There isn't one. Period. Whatever algorithm you invent, I promise to find a counterexample. In certain languages(Armenian and Russian come to mind) the algorithm is pretty straightforward - count the number of vowels. In other languages, such as German, it's not as straightforward but still doable. In English, I am afraid, the transduction between letters and sounds is absolutely irregular.
For example,
coincidence. oi is to be counted as two syllables. But in boil it's only one syllable. Also, not counting the final vowel is not always accurate. Consider the name Penelope or Hermione. Or banana
Another curious case is when the syllable exists without a printed vowel. For example, table is a bisyllabic word but the second syllable is generated by the invisible sound between b and l. Also, don't forget about words originated from greek, which can have a lot of consecutive vowels. E.g. onomatopoeia.
So, there is no accurate algorithm. The only way you can go is to try to find an algorithm which works in many (I am avoiding the word most) cases. But in this case you should redefine your requirements.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With