Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to know text is Arabic or in Urdu

I want to know is text contain any letter in Urdu or Arabic..using this condition which produce false results when special characters comes.what is right way to do it .any library or what is right regex for this ?

   if (cap.replaceAll("\\s+", "").matches("[A-Za-z]+")
                    || cap.replaceAll("\\s+", "").matches("[A-Za-z0-9]+")) {
                Log.d("isUrdu", "false");
                caption.setTypeface(Typeface.DEFAULT);
                caption.setTextSize(16);

            } else {
                Log.d("isUrdu", "True");
             /*   if (Build.VERSION.SDK_INT > Build.VERSION_CODES.JELLY_BEAN_MR1) {*/
                    caption.setTypeface(typeface);
                    caption.setTextSize(20);

         /*       }*/
            }
like image 339
Usman Saeed Avatar asked Oct 03 '16 10:10

Usman Saeed


People also ask

How do you know if your writing is Arabic?

In this case, there is an easy way to know if a text is Arabic by visually searching for the letters ي ة ـة , the letter “y” at the end of words: ي and the feminine “ta” endings of words: ـة ة , — all clear signs that this is Arabic.

How can you tell Urdu from Arabic?

The two languages, Urdu and Arabic, are written in Arabic script, and you can think they are closely related, but there are differences in the writing systems. Urdu is written in Nastaliq style, while Arabic is written in Nashk style, which was mainly perfect for documenting the Quran because of the clear writing.

Is Arabic and Urdu writing same?

It is a modification of the Persian alphabet, which is itself a derivative of the Arabic alphabet. The Urdu alphabet has up to 39 or 40 distinct letters with no distinct letter cases and is typically written in the calligraphic Nastaʿlīq script, whereas Arabic is more commonly written in the Naskh style.

Is Urdu written in Arabic or Persian?

Unlike Persian, which is an Iranian language, Urdu is an Indo-Aryan language, written in the Perso-Arabic script; Urdu has a Indic vocabulary base derived from Sanskrit and Prakrit, with specialized vocabulary being borrowed from Persian.


1 Answers

Taking a look at the Wikipedia Urdu alphabet, it includes the following Unicode ranges:

U+0600 to U+06FF
U+0750 to U+077F
U+FB50 to U+FDFF
U+FE70 to U+FEFF

To match an Arabic letter, you may use a \p{InArabic} Unicode property class.

So, you may use

if (cap.matches("(?s).*[\\u0600-\\u06FF\\u0750-\\u077F\\uFB50-\\uFDFF\\uFE70‌​-\\uFEFF].*"))
{
    /*There is an Urdu character*/
} 
else if (cap.matches("(?s).*\\p{InArabic}.*"))
{  
    /* The string contains an Arabic character */ 
}
else { /*No Arabic nor Urdu chars detected */ }

Note that (?s) enables the DOTALL modifier so that . could match linebreak symbols, too.

For better performance with matches, you may use reverse classes instead of the first .*: "(?s)[^\\u0600-\\u06FF\\u0750-\\u077F\\uFB50-\\uFDFF\\uFE70‌​-\\uFEFF]*[\\u0600-\\u06FF\\u0750-\\u077F\\uFB50-\\uFDFF\\uFE70‌​-\\uFEFF].*" and "(?s)\\P{InArabic}*\\p{InArabic}.*" respectively.

Note you may also use shorter "[\\u0600-\\u06FF\\u0750-\\u077F\\uFB50-\\uFDFF\\uFE70‌​-\\uFEFF]" and "\\p{InArabic}" patterns with Matcher#find().

like image 62
Wiktor Stribiżew Avatar answered Sep 21 '22 19:09

Wiktor Stribiżew