Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split Tamil characters in a string in PHP

How do I split Tamil characters in a string?

When I use preg_match_all('/./u', $str, $results),
I get the characters "த", "ம", "ி", "ழ" and "்".

How do I get the combined characters "த", "மி" and "ழ்"?

like image 230
priyacst Avatar asked Jan 10 '12 03:01

priyacst


2 Answers

I think you should be able to use the grapheme_extract function to iterate over the combined characters (which are technically called "grapheme clusters").

Alternatively, if you prefer the regex approach, I think you can use this:

preg_match_all('/\pL\pM*|./u', $str, $results)

where \pL means a Unicode "letter", and \pM means a Unicode "mark".

(Disclaimer: I have not tested either of these approaches.)

like image 115
ruakh Avatar answered Oct 09 '22 22:10

ruakh


if I understand your question correctly, you've got a unicode string containing codepoints, and you want to convert this into an array of graphames?

I'm working on developing an open source Python library to do tasks like this for a Tamil Language website.

I haven't used PHP in a while, so I'll post the logic. You can take a look at the code in the amuthaa/TamilWord.py file's split_letters() function.

As ruakh mentioned, Tamil graphemes are constructed as codepoints.

  • The vowels (உயிர் எழுத்து), aytham (ஆய்த எழுத்து - ஃ) and all the combinations ((உயிர்-மெய் எழுத்து) in the 'a' column (அ வரி - i.e. க, ச, ட, த, ப, ற, ங, ஞ, ண, ந, ம, ன, ய, ர, ள, வ, ழ, ல) each use a single codepoint.

  • Every consonant is made up of two codepoints: the a-combination letter + the pulli. E.g. ப் = ப + ்

  • Every combination other than the a-combinations are also made up of two codepoints: the a-combination letter + a marking: e.g. பி = ப் + ி, தை = த் + ை

So if your logic is going to be something like this:

initialize an empty array

for each codepoint in word:

    if the codepoint is a vowel, a-combination or aytham, it is also its grapheme, so add it to the array

    otherwise, the codepoint is a marking such as the pulli (i.e. ்) or one of the combination extensions (e.g.  ி or  ை), so append it to the end of the last element of the array

This of course assumes that your string is well-formed and you don't have things like two markings in a row.

Here's the Python code, in case you find it helpful. If you want to help us port this to PHP, please let me know as well:

@staticmethod
def split_letters(word=u''):
    """ Returns the graphemes (i.e. the Tamil characters) in a given word as a list """

    # ensure that the word is a valid word
    TamilWord.validate(word)

    # list (which will be returned to user)
    letters = []

    # a tuple of all combination endings and of all அ combinations
    combination_endings = TamilLetter.get_combination_endings()
    a_combinations = TamilLetter.get_combination_column(u'அ').values()

    # loop through each codepoint in the input string
    for codepoint in word:

        # if codepoint is an அ combination, a vowel, aytham or a space,
        # add it to the list
        if codepoint in a_combinations or \
            TamilLetter.is_whitespace(codepoint) or \
            TamilLetter.is_vowel(codepoint) or \
            TamilLetter.is_aytham(codepoint):

            letters.append(codepoint)

        # if codepoint is a combination ending or a pulli ('்'), add it
        # to the end of the previously-added codepoint
        elif codepoint in combination_endings or \
            codepoint == TamilLetter.get_pulli():

            # ensure that at least one character already exists
            if len(letters) > 0:
                letters[-1] = letters[-1] + codepoint

            # otherwise raise an Error. However, validate_word()
            # should catch this
            else:
                raise ValueError("""%s cannot be first character of a word""" % (codepoint))

    return letters
like image 25
Ashwin Balamohan Avatar answered Oct 09 '22 22:10

Ashwin Balamohan