How do I split Tamil characters in a string?
When I use preg_match_all('/./u', $str, $results)
,
I get the characters "த", "ம", "ி", "ழ" and "்".
How do I get the combined characters "த", "மி" and "ழ்"?
I think you should be able to use the grapheme_extract
function to iterate over the combined characters (which are technically called "grapheme clusters").
Alternatively, if you prefer the regex approach, I think you can use this:
preg_match_all('/\pL\pM*|./u', $str, $results)
where \pL
means a Unicode "letter", and \pM
means a Unicode "mark".
(Disclaimer: I have not tested either of these approaches.)
if I understand your question correctly, you've got a unicode string containing codepoints, and you want to convert this into an array of graphames?
I'm working on developing an open source Python library to do tasks like this for a Tamil Language website.
I haven't used PHP in a while, so I'll post the logic. You can take a look at the code in the amuthaa/TamilWord.py file's split_letters() function.
As ruakh mentioned, Tamil graphemes are constructed as codepoints.
The vowels (உயிர் எழுத்து), aytham (ஆய்த எழுத்து - ஃ) and all the combinations ((உயிர்-மெய் எழுத்து) in the 'a' column (அ வரி - i.e. க, ச, ட, த, ப, ற, ங, ஞ, ண, ந, ம, ன, ய, ர, ள, வ, ழ, ல) each use a single codepoint.
Every consonant is made up of two codepoints: the a-combination letter + the pulli. E.g. ப் = ப + ்
Every combination other than the a-combinations are also made up of two codepoints: the a-combination letter + a marking: e.g. பி = ப் + ி, தை = த் + ை
So if your logic is going to be something like this:
initialize an empty array
for each codepoint in word:
if the codepoint is a vowel, a-combination or aytham, it is also its grapheme, so add it to the array
otherwise, the codepoint is a marking such as the pulli (i.e. ்) or one of the combination extensions (e.g. ி or ை), so append it to the end of the last element of the array
This of course assumes that your string is well-formed and you don't have things like two markings in a row.
Here's the Python code, in case you find it helpful. If you want to help us port this to PHP, please let me know as well:
@staticmethod
def split_letters(word=u''):
""" Returns the graphemes (i.e. the Tamil characters) in a given word as a list """
# ensure that the word is a valid word
TamilWord.validate(word)
# list (which will be returned to user)
letters = []
# a tuple of all combination endings and of all அ combinations
combination_endings = TamilLetter.get_combination_endings()
a_combinations = TamilLetter.get_combination_column(u'அ').values()
# loop through each codepoint in the input string
for codepoint in word:
# if codepoint is an அ combination, a vowel, aytham or a space,
# add it to the list
if codepoint in a_combinations or \
TamilLetter.is_whitespace(codepoint) or \
TamilLetter.is_vowel(codepoint) or \
TamilLetter.is_aytham(codepoint):
letters.append(codepoint)
# if codepoint is a combination ending or a pulli ('்'), add it
# to the end of the previously-added codepoint
elif codepoint in combination_endings or \
codepoint == TamilLetter.get_pulli():
# ensure that at least one character already exists
if len(letters) > 0:
letters[-1] = letters[-1] + codepoint
# otherwise raise an Error. However, validate_word()
# should catch this
else:
raise ValueError("""%s cannot be first character of a word""" % (codepoint))
return letters
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With