Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which romanization standard should be used to improve ICU4j transliteration for Arabic-Latin?

We have a requirement to transliterate Arabic text to Latin characters(without diacritical marks) and display them to users.

We are currently using IBM ICU4j for this. The API doesn't trasliterate well the Arabic text into proper readable latin characters. Refer the below examples:

Example

  • Arabic text :

    صدام حسين التكريتي

  • Google's transliteration output

    : Sadaam Hussein al-tikriti

  • ICU4J's transliteration outuput

    : ṣdạm ḥsyn ạltkryty

How can we improve the transliterated output of ICU4j library?

ICU4J gives us an option to write our own rules but we are currently stuck as no one from our team knows Arabic and are unable to find any proper standard that can be followed.

like image 970
Kamlesh Sharma Avatar asked Jun 20 '18 07:06

Kamlesh Sharma


1 Answers

It's took 4 hours me to research out any other source to tackle out this problem.Later i tried ICU4J and find the solution for your problem .You can run the code and see the point which you was missing.

package com.webom.crypt;

import org.apache.commons.lang3.StringEscapeUtils;

import com.ibm.icu.text.Transliterator;

public class Test {



        public static String ARABIC_TO_LATIN = "Arabic-Latin";
        public static String ARABIC_TO_LATIN_NO_ACCENTS = "Arabic-Latin; nfd; [:nonspacing mark:] remove; nfc";

        public static void main(String[] args) {
            String ARABICString = "صدام حسين التكريتي";

            String unicodeCodes = StringEscapeUtils.escapeJava(ARABICString);
            System.out.println("Unicode codes:" + unicodeCodes);
 ///YOUR WAY
            Transliterator ARABICToLatinTrans = Transliterator.getInstance(ARABIC_TO_LATIN);
            String result1 = ARABICToLatinTrans.transliterate(ARABICString);
            System.out.println("ARABIC to Latin:" + result1);
    //MINE WAY      
            Transliterator ARABICToLatinNoAccentsTrans = Transliterator.getInstance(ARABIC_TO_LATIN_NO_ACCENTS);
            String result2 = ARABICToLatinNoAccentsTrans.transliterate(ARABICString);
            System.out.println("ARABIC to Latin (no accents):" + result2);
        }
    }

Just checkout the answer and verify on your own.As the output you receive will be exactly as shown below.

 Unicode codes:\u0635\u062F\u0627\u0645 \u062D\u0633\u064A\u0646\u0627\u0644\u062A\u0643\u0631\u064A\u062A\u064A

ARABIC to Latin:ṣdạm ḥsyn ạltkryty

ARABIC to Latin (no accents):sdam hsyn altkryty
like image 55
this_is_om_vm Avatar answered Oct 17 '22 05:10

this_is_om_vm