I want to know is text contain any letter in Urdu or Arabic..using this condition which produce false results when special characters comes.what is right way to do it .any library or what is right regex for this ?
if (cap.replaceAll("\\s+", "").matches("[A-Za-z]+")
|| cap.replaceAll("\\s+", "").matches("[A-Za-z0-9]+")) {
Log.d("isUrdu", "false");
caption.setTypeface(Typeface.DEFAULT);
caption.setTextSize(16);
} else {
Log.d("isUrdu", "True");
/* if (Build.VERSION.SDK_INT > Build.VERSION_CODES.JELLY_BEAN_MR1) {*/
caption.setTypeface(typeface);
caption.setTextSize(20);
/* }*/
}
In this case, there is an easy way to know if a text is Arabic by visually searching for the letters ي ة ـة , the letter “y” at the end of words: ي and the feminine “ta” endings of words: ـة ة , — all clear signs that this is Arabic.
The two languages, Urdu and Arabic, are written in Arabic script, and you can think they are closely related, but there are differences in the writing systems. Urdu is written in Nastaliq style, while Arabic is written in Nashk style, which was mainly perfect for documenting the Quran because of the clear writing.
It is a modification of the Persian alphabet, which is itself a derivative of the Arabic alphabet. The Urdu alphabet has up to 39 or 40 distinct letters with no distinct letter cases and is typically written in the calligraphic Nastaʿlīq script, whereas Arabic is more commonly written in the Naskh style.
Unlike Persian, which is an Iranian language, Urdu is an Indo-Aryan language, written in the Perso-Arabic script; Urdu has a Indic vocabulary base derived from Sanskrit and Prakrit, with specialized vocabulary being borrowed from Persian.
Taking a look at the Wikipedia Urdu alphabet, it includes the following Unicode ranges:
U+0600 to U+06FF
U+0750 to U+077F
U+FB50 to U+FDFF
U+FE70 to U+FEFF
To match an Arabic letter, you may use a \p{InArabic}
Unicode property class.
So, you may use
if (cap.matches("(?s).*[\\u0600-\\u06FF\\u0750-\\u077F\\uFB50-\\uFDFF\\uFE70-\\uFEFF].*"))
{
/*There is an Urdu character*/
}
else if (cap.matches("(?s).*\\p{InArabic}.*"))
{
/* The string contains an Arabic character */
}
else { /*No Arabic nor Urdu chars detected */ }
Note that (?s)
enables the DOTALL
modifier so that .
could match linebreak symbols, too.
For better performance with matches
, you may use reverse classes instead of the first .*
: "(?s)[^\\u0600-\\u06FF\\u0750-\\u077F\\uFB50-\\uFDFF\\uFE70-\\uFEFF]*[\\u0600-\\u06FF\\u0750-\\u077F\\uFB50-\\uFDFF\\uFE70-\\uFEFF].*"
and "(?s)\\P{InArabic}*\\p{InArabic}.*"
respectively.
Note you may also use shorter "[\\u0600-\\u06FF\\u0750-\\u077F\\uFB50-\\uFDFF\\uFE70-\\uFEFF]"
and "\\p{InArabic}"
patterns with Matcher#find()
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With