Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

using perl or python to replace arabic character "ا" with "a" in one word, but "ә" in a different word

I need to change a plain text UTF8 document from a R to L language to a Latin language. It isn't as easy as a character-character transliteration unfortunately.
For example, the "a" in the R to L language (ا) can be either "a" or "ә" depending on the word composition.

In words with a g, k, e, or hamza (گ،ك،ە، ء)
I need to change all the a, o, i, u (ا،و،ى،ۇ) to Latin ә, ѳ, i, ü (called "soft" vowels).
eg. سالەم becomes sәlêm, ءۇي becomes üy, سوزمەن becomes sѳzmên

In words without a g, k, e, or hamza (گ،ك،ە، ء)
the a, o, i, u change to Latin characters, a, o, i, u (called "hard" vowels).
eg. الما becomes alma, ۇل becomes ul, ورتا becomes orta.

In essence,
the g, k, e, or hamza act as a pronounciation guide in the arabic script.
In Latin, then I need two different sets of vowels depending on the original word in the arabic script.

I was thinking I might need to do the "soft" vowel words as step one, then do a separate Find and Replace on the rest of the document. BUT, how do I conduct a Find and Replace like this anyway with perl, or python?

Here is a unicode example: \U+0633\U+0627\U+0644\U+06D5\U+0645 \U+0648\U+0631\U+062A\U+0627 \U+0674\U+06C7\U+064A \U+0633\U+0648\U+0632\U+0645\U+06D5\U+0645 \U+0627\U+0644\U+0645\U+0627 \U+06C7\U+0644 \U+0645\U+06D5\U+0646\U+0649\U+06AD \U+0627\U+062A\U+0649\U+0645 \U+0634\U+0627\U+0644\U+0642\U+0627\U+0631.

It should come out looking like: "sәlêm orta üy sѳzmên alma ul mêning atim xalқar".(NOTE: the letter ڭ, which is U+06AD actually ends up as two letters, n+g, to make an "-ng" sound). It shouldn't look like "salêm orta uy sozmên alma ul mêning atim xalқar", nor "sәlêm ѳrtә üy sѳzmên әlmә ül mêning әtim xәlқәr".

Much thanks to any help.

like image 835
Shane Avatar asked Jan 30 '13 10:01

Shane


2 Answers

Command:

$ echo سالەم ورتا ءۇي سوزمەن الما ۇل مەنىڭ اتىم شالقار | ./arabic-to-latin

Output:

sәlêm orta üy sѳzmên alma ul mêning atim xalқar

To use files instead of stdin/stdout:

$ ./arabic-to-latin input_file_with_arabic_text_in_utf8 >output_latin_in_utf8

Where arabic-to-latin file:

#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use open qw(:std :utf8);
#XXX normalization

sub replace_word {
    my ($word) = @_;
    $_ = $word;
    if (/ء|ە|ك|گ/) { # g, k, e, or hamza in the word
        tr/اوىۇ/әѳiü/; # soft
    } else {
        tr/اوىۇ/aoiu/; # hard
    }
    tr/سلەمرتزنشق/slêmrtznxқ/;
    s/ءüي/üy/g;
    s/ڭ/ng/g;
    $_;
}

while (my $line = <>) {
    $line =~ s/(\w+)/replace_word($1)/ge;
    print $line;
}

To make arabic-to-latin file executable:

$ chmod +x ./arabic-to-latin
like image 59
jfs Avatar answered Nov 15 '22 15:11

jfs


You can build your own translation table with ordinal mapping to substitute characters, for each set of chars, you would need a separate table (for vowels). This is only a partial example, but should give you an idea how to do it.


Note that you would need to specify the translation table for other chars. You can also translate one arabic char to multiple latin ones if it's needed. If you compare the output to your request, it seems that all chars in the translation table match correctly.

import re

s1 = {u'ء',u'ە',u'ك',u'گ'} # g, k, e, hamza

t1 = {ord(u'ا'):u'ә',  # first case
      ord(u'و'):u'ѳ',
      ord(u'ى'):u'i',
      ord(u'ۇ'):u'ü',
      ord(u'ڭ'):u'ng'} # with double

t2 = {ord(u'ا'):u'a',  # second case
      ord(u'و'):u'o',
      ord(u'ى'):u'i',
      ord(u'ۇ'):u'u',
      ord(u'ڭ'):u'ng'} # with double

def subst(word):    
    if any(c in s1 for c in word):
        return word.translate(t1)
    else:
        return word.translate(t2)

s = u'سالەم ورتا ءۇي سوزمەن الما ۇل مەنىڭ اتىم شالقار'

print re.sub(ur'(\S+)', lambda m: subst(m.group(1)), s)

# output:    سәلەم oرتa ءüي سѳزمەن aلمa uل مەنing aتiم شaلقaر

# requested: sәlêm orta üy sѳzmên alma ul mêning atim xalқar
like image 40
root Avatar answered Nov 15 '22 17:11

root