I need to change a plain text UTF8 document from a R to L language to a Latin language. It isn't as easy as a character-character transliteration unfortunately.
For example, the "a" in the R to L language (ا) can be either "a" or "ә" depending on the word composition.
In words with a g, k, e, or hamza (گ،ك،ە، ء)
I need to change all the a, o, i, u (ا،و،ى،ۇ) to Latin ә, ѳ, i, ü (called "soft" vowels).
eg. سالەم becomes sәlêm, ءۇي becomes üy, سوزمەن becomes sѳzmên
In words without a g, k, e, or hamza (گ،ك،ە، ء)
the a, o, i, u change to Latin characters, a, o, i, u (called "hard" vowels).
eg. الما becomes alma, ۇل becomes ul, ورتا becomes orta.
In essence,
the g, k, e, or hamza act as a pronounciation guide in the arabic script.
In Latin, then I need two different sets of vowels depending on the original word in the arabic script.
I was thinking I might need to do the "soft" vowel words as step one, then do a separate Find and Replace on the rest of the document. BUT, how do I conduct a Find and Replace like this anyway with perl, or python?
Here is a unicode example: \U+0633\U+0627\U+0644\U+06D5\U+0645 \U+0648\U+0631\U+062A\U+0627 \U+0674\U+06C7\U+064A \U+0633\U+0648\U+0632\U+0645\U+06D5\U+0645 \U+0627\U+0644\U+0645\U+0627 \U+06C7\U+0644 \U+0645\U+06D5\U+0646\U+0649\U+06AD \U+0627\U+062A\U+0649\U+0645 \U+0634\U+0627\U+0644\U+0642\U+0627\U+0631.
It should come out looking like: "sәlêm orta üy sѳzmên alma ul mêning atim xalқar".(NOTE: the letter ڭ, which is U+06AD actually ends up as two letters, n+g, to make an "-ng" sound). It shouldn't look like "salêm orta uy sozmên alma ul mêning atim xalқar", nor "sәlêm ѳrtә üy sѳzmên әlmә ül mêning әtim xәlқәr".
Much thanks to any help.
Command:
$ echo سالەم ورتا ءۇي سوزمەن الما ۇل مەنىڭ اتىم شالقار | ./arabic-to-latin
Output:
sәlêm orta üy sѳzmên alma ul mêning atim xalқar
To use files instead of stdin/stdout:
$ ./arabic-to-latin input_file_with_arabic_text_in_utf8 >output_latin_in_utf8
Where arabic-to-latin
file:
#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use open qw(:std :utf8);
#XXX normalization
sub replace_word {
my ($word) = @_;
$_ = $word;
if (/ء|ە|ك|گ/) { # g, k, e, or hamza in the word
tr/اوىۇ/әѳiü/; # soft
} else {
tr/اوىۇ/aoiu/; # hard
}
tr/سلەمرتزنشق/slêmrtznxқ/;
s/ءüي/üy/g;
s/ڭ/ng/g;
$_;
}
while (my $line = <>) {
$line =~ s/(\w+)/replace_word($1)/ge;
print $line;
}
To make arabic-to-latin
file executable:
$ chmod +x ./arabic-to-latin
You can build your own translation table with ordinal mapping to substitute characters, for each set of chars, you would need a separate table (for vowels). This is only a partial example, but should give you an idea how to do it.
Note that you would need to specify the translation table for other chars. You can also translate one arabic char to multiple latin ones if it's needed. If you compare the output to your request, it seems that all chars in the translation table match correctly.
import re
s1 = {u'ء',u'ە',u'ك',u'گ'} # g, k, e, hamza
t1 = {ord(u'ا'):u'ә', # first case
ord(u'و'):u'ѳ',
ord(u'ى'):u'i',
ord(u'ۇ'):u'ü',
ord(u'ڭ'):u'ng'} # with double
t2 = {ord(u'ا'):u'a', # second case
ord(u'و'):u'o',
ord(u'ى'):u'i',
ord(u'ۇ'):u'u',
ord(u'ڭ'):u'ng'} # with double
def subst(word):
if any(c in s1 for c in word):
return word.translate(t1)
else:
return word.translate(t2)
s = u'سالەم ورتا ءۇي سوزمەن الما ۇل مەنىڭ اتىم شالقار'
print re.sub(ur'(\S+)', lambda m: subst(m.group(1)), s)
# output: سәلەم oرتa ءüي سѳزمەن aلمa uل مەنing aتiم شaلقaر
# requested: sәlêm orta üy sѳzmên alma ul mêning atim xalқar
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With