Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split on non arabic characters

Tags:

java

regex

I have a String like this

أصبح::ينال::أخذ::حصل (على)::أحضر

And I want to split it on non Arabic characters using java

And here's my code

String s = "أصبح::ينال::أخذ::حصل (على)::أحضر";
String[] arr = s.split("^\\p{InArabic}+");
System.out.println(Arrays.toString(arr));

And the output was

[, ::ينال::أخذ::حصل (على)::أحضر]

But I expect the output to be

[ينال,أخذ,حصل,على,أحضر]

So I don't know what's wrong with this?

like image 611
Tareq Salah Avatar asked Dec 31 '13 10:12

Tareq Salah


1 Answers

You need a negated class, and to do that, you need square brackets [ ... ]. Try to split with this:

"[^\\p{InArabic}]+"

If \\p{InArabic} matches any arabic character, then [^\\p{InArabic}] will match any non-arabic character.


Another option you can consider is an equivalent syntax, using P instead of p to indicate the opposite of the \\p{InArabic} character class like @Pshemo mentioned:

"\\P{InArabic}+"

This works just like \\W is the opposite of \\w.

The only possible advantage you get with the first syntax over the second (again like @Pshemo mentioned), is that if you want to add other characters to the list of characters which shouldn't match, for example, if you want to match all non \\p{InArabic} except periods, the first one is more flexible:

"[^\\p{InArabic}.]+"
                ^

Otherwise, if you really want to use \\P{InArabic}, you'll need subtraction within classes:

"[\\P{InArabic}&&[^.]]+"
like image 76
Jerry Avatar answered Oct 09 '22 02:10

Jerry