Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Breaking down a Hangul syllable into letters (jamo)

I'm working on a program that deals with Korean sentences and I need a way to break down a syllable, or block, into its letters. For those who don't know Hangul, a syllable is composed of 2-4 letters (jamo), creating thousands of different combinations. What I'd like to do is break down those syllables into the letters that form it.

I was able to get the first letter by comparing its Unicode value to the associated letter in that range, i.e. a syllable that starts with x letter is in y range. However, I'm at a loss for finding the rest of the letters.

This is a table containing the Unicode values for Hangul syllables: http://jrgraphix.net/r/Unicode/AC00-D7AF

like image 327
Ninjaman494 Avatar asked Dec 06 '22 16:12

Ninjaman494


2 Answers

Hangul syllable decomposition (e.g. + + ) is done in Java through the java.text.Normalizer class:

String s = Normalizer.normalize("\uD4DB", Normalizer.Form.NFD);

The algorithm for Hangul decomposition is also given in Section 3.12 of the Unicode Standard (from page 142); and since normalisation also affects other, non-Hangul characters, you should familiarise yourself with the general principles and forms of Unicode normalisation in UAX #15.

like image 91
一二三 Avatar answered Dec 20 '22 14:12

一二三


Basically, the algorithm decomposing a Hangul LVT or LV syllable is:

  • substract 0xAC00 from the scalar value of the codepoint (betweeen U+AC00 and U+D7A3, not U+D7AF as you state),
  • divide the previous difference by 28, and:
    • if this first rest is 0, then there's no T jamo;
    • else add 0x11A7 to this first rest (between 1 and 27) to get the final T jamo (between U+11A8 and U+11C2);
  • divide the previous quotient by 21, and:
    • add 0x1161 to the second rest (between 0 and 20) to get the medial (or final) V jamo (between U+1161 and U+1175);
    • add 0x1100 to the second quotient (between 0 and 17) to get the leading L jamo (between U+1100 and U+1112);

The other Hangul letters (in the range U+1113 and U+11F9, excluding the simple L, V and T jamos in the 3 ranges returned above, or in extended jamos in the range U+3131 to U+318E) that are decomposable into pairs of simple jamos can be processed by a small table lookup (taken from the main UCD table containing canonical decomposition pairs for Hangul).

The algorithm is standardized in Unicode, just to avoid mapping inside the UCD table the canonical decompositions of 10,584 Hangul precomposed characters (that are precomposed Hangul LV or LVT syllables) into a triple (L, V, T simple jamos), forbidden in the UCD, or into a pair (L, V simple jamos) or into a pair (LV syllable, T simple jamo).

For this reason, the UCD table only list the first and last precomposed LV or LVT characters that are decomposable algorithmically; as well they have all the same character properties (except their "L/V/T/LV/LVT" type which is listed in an auxiliary table of the UCD, specific to Hangul)

Note that some LL or TT precombined consonnants are treated as simple jamos and are not decomposable. This is by tradition in Hangul for double consonnants ("SANG" jamos), but this is visible in the primary sort order of the jamo alphabet, where the double consonnant sorts just after the single consonnant, all L or LL jamos sort before all V jamos, and all V jamos sort before T and TT jamos.

Normally in wellformed Korean syllables, a V jamo (or vowel) can only occur after a L leading jamo (consonnants), and a T trailing jamo (consonnants) can only occur after a V jamo (vowel).

But there are some exceptions to force a wellformed Hangul syllable: the missing L jamo (before an isolated V jamo) can become wellformed if you insert a leading Hangul V filler (a control not rendered), and the missing V jamo (before an isolated T jamo) can become wellformed if you insert a leading Hangul V filler (a control not rendered). This is sometimes used to translitterate some non Korean words starting by vowels but usually Korean use (and render) the last consonnant (IOSEUNG, leading or trailing) of their alphabet for a missing L jamo.

Finally the (L*, V*, T*) wellformed Hangul syllables can be followed by some tone marks (dots written to the right of the syllable rendered in a single square). The layout order of the (L*, V*, T*) syllabic square is standard in Korean: all L* aligned horizontally, all V* aligned horizontally, all T* aligned horizontally, then the L* block to the left part of the square, the V* block to the right, and eventual T* block below the (L*, V*) block. the tone marks are added separately to the right of the (L*, V*, T*) Hangul square containing all jamos of the same syllable.

Korean also has "half-width" variants of the letters which are not strictly "Hangul". The halfwidth letters include only L or LV "half-syllables" and there's no T "half-syllables (they are replaced by L half-syllables): these halfwidth letters do not render in a Hangul square, they can occur in any order, just like Latin letters. The half-width LV syllables are also decomposable into half-width L and half-width V syllables, but as this would enlarge how they are rendered, such decomposition is not canonically equivalent (the halfwidth LV syllables are considered like unbreakable ligatures, similar to the AE or IJ ligatures in Latin; such decomposition is only used to sort them in a Korean collation and this is visible in the primary sort order of the Korean alphabet). These half-width letters were encoded for use in old terminals or on old typewriters when the character set had to be limited (because it was not possible to render all possible Hangul squares).

like image 33
verdy_p Avatar answered Dec 20 '22 12:12

verdy_p