Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Combining Devanagari characters

I have something like

a = "बिक्रम मेरो नाम हो"

I want to achieve something like

a[0] = बि
a[1] = क्र
a[3] = म

but as म takes 4 bytes while बि takes 8 bytes I am not able to get to that straight. So what could be done to achieve that? In Python.

like image 968
meadhikari Avatar asked Jul 24 '11 06:07

meadhikari


4 Answers

The algorithm for splitting text into grapheme clusters is given in Unicode Annex 29, section 3.1. I'm not going to implement the full algorithm for you here, but I'll show you roughly how to handle the case of Devanagari, and then you can read the Annex for yourself and see what else you need to implement.

The unicodedata module contains the information you need to detect the grapheme clusters.

>>> import unicodedata
>>> a = "बिक्रम मेरो नाम हो"
>>> [unicodedata.name(c) for c in a]
['DEVANAGARI LETTER BA', 'DEVANAGARI VOWEL SIGN I', 'DEVANAGARI LETTER KA', 
 'DEVANAGARI SIGN VIRAMA', 'DEVANAGARI LETTER RA', 'DEVANAGARI LETTER MA',
 'SPACE', 'DEVANAGARI LETTER MA', 'DEVANAGARI VOWEL SIGN E',
 'DEVANAGARI LETTER RA', 'DEVANAGARI VOWEL SIGN O', 'SPACE',
 'DEVANAGARI LETTER NA', 'DEVANAGARI VOWEL SIGN AA', 'DEVANAGARI LETTER MA',
 'SPACE', 'DEVANAGARI LETTER HA', 'DEVANAGARI VOWEL SIGN O']

In Devanagari, each grapheme cluster consists of an initial letter, optional pairs of virama (vowel killer) and letter, and an optional vowel sign. In regular expression notation that would be LETTER (VIRAMA LETTER)* VOWEL?. You can tell which is which by looking up the Unicode category for each code point:

>>> [unicodedata.category(c) for c in a]
['Lo', 'Mc', 'Lo', 'Mn', 'Lo', 'Lo', 'Zs', 'Lo', 'Mn', 'Lo', 'Mc', 'Zs',
 'Lo', 'Mc', 'Lo', 'Zs', 'Lo', 'Mc']

Letters are category Lo (Letter, Other), vowel signs are category Mc (Mark, Spacing Combining), virama is category Mn (Mark, Nonspacing) and spaces are category Zs (Separator, Space).

So here's a rough approach to split out the grapheme clusters:

def splitclusters(s):
    """Generate the grapheme clusters for the string s. (Not the full
    Unicode text segmentation algorithm, but probably good enough for
    Devanagari.)

    """
    virama = u'\N{DEVANAGARI SIGN VIRAMA}'
    cluster = u''
    last = None
    for c in s:
        cat = unicodedata.category(c)[0]
        if cat == 'M' or cat == 'L' and last == virama:
            cluster += c
        else:
            if cluster:
                yield cluster
            cluster = c
        last = c
    if cluster:
        yield cluster

>>> list(splitclusters(a))
['बि', 'क्र', 'म', ' ', 'मे', 'रो', ' ', 'ना', 'म', ' ', 'हो']
like image 142
Gareth Rees Avatar answered Nov 15 '22 12:11

Gareth Rees


So, you want to achieve something like this

a[0] = बि a[1] = क्र a[3] = म

My advice is to ditch the idea that string indexing corresponds to the characters you see on the screen. Devanagari, as well as several other scripts, do not play well with programmers who grew up with Latin characters. I suggest reading the Unicode standard chapter 9 (available here).

It looks like what you are trying to do is break a string into grapheme clusters. String indexing by itself will not let you do this. Hangul is another script which plays poorly with string indexing, although with combining characters, even something as familiar as Spanish will cause problems.

You will need an external library such as ICU to achieve this (unless you have lots of free time). ICU has Python bindings.

>>> a = u"बिक्रम मेरो नाम हो"
>>> import icu
    # Note: This next line took a lot of guesswork.  The C, C++, and Java
    # interfaces have better documentation.
>>> b = icu.BreakIterator.createCharacterInstance(icu.Locale())
>>> b.setText(a)
>>> i = 0
>>> for j in b:
...     s = a[i:j]
...     print '|', s, len(s)
...     i = j
... 
| बि 2
| क् 2
| र 1
| म 1
|   1
| मे 2
| रो 2
|   1
| ना 2
| म 1
|   1
| हो 2

Note how some of these "characters" (grapheme clusters) have length 2, and some have length 1. This is why string indexing is problematic: if I want to get grapheme cluster #69450 from a text file, then I have to linearly scan through the entire file and count. So your options are:

  • Build an index (kind of crazy...)
  • Just realize that you can't break on every character boundary. The break iterator object is capable of going both forwards AND backwards, so if you need to extract the first 140 characters of a string, then you look at index 140 and iterate backwards to the previous grapheme cluster break, that way you don't end up with funny text. (Better yet, you can use a word break iterator for the appropriate locale.) The benefit of using this level of abstraction (character iterators and the like) is that it no longer matters which encoding you use: you can use UTF-8, UTF-16, UTF-32 and it all just works. Well, mostly works.
like image 36
Dietrich Epp Avatar answered Nov 15 '22 12:11

Dietrich Epp


You can achieve this with a simple regex for any engine that supports \X

Demo

Unfortunately, Python's re does not support the \X grapheme match.

Fortunately, the proposed replacement, regex, does support \X:

>>> a = "बिक्रम मेरो नाम हो"
>>> regex.findall(r'\X', a)
['बि', 'क्', 'र', 'म', ' ', 'मे', 'रो', ' ', 'ना', 'म', ' ', 'हो']
like image 4
dawg Avatar answered Nov 15 '22 11:11

dawg


Indic and non Latin scripts like Hangul do not generally follow the idea of matching string indices to code points. It's generally a pain working with Indic scripts. Most characters are two bytes with some rare ones extending into three. With Dravidian, it's no defined order. See the Unicode specification for more details.

That said,check here for some ideas about unicode and python with C++.

Finally,as said by Dietrich, you might want to check out ICU too. It has bindings available for C/C++ and java via icu4c and icu4j respectively. There's some learning curve involved, so I suggest you set aside some loads of time for it. :)

like image 1
S.R.I Avatar answered Nov 15 '22 13:11

S.R.I