Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replacing Emoji Unicode Range from Arabic Tweets using Java

I am trying to replace emoji from Arabic tweets using java.

I used this code:

String line = "اييه تقولي اجل الارسنال تعادل امس بعد ما كان فايز 😂😂";
Pattern unicodeOutliers = Pattern.compile("([\u1F601-\u1F64F])", Pattern.UNICODE_CASE | Pattern.CANON_EQ | Pattern.CASE_INSENSITIVE);
Matcher unicodeOutlierMatcher = unicodeOutliers.matcher(line);
line = unicodeOutlierMatcher.replaceAll(" $1 ");

But it is not replacing them. Even if I am matching only the character itself "\u1F602" it is not replacing it. May be because it is 5 digits after the u?! I am not sure, just a guess.

Note that:

1- the emotion at the end of the tweet (😂) is the "U+1F602" which is "face with tears of joy"

2- this question is not a duplicate for this question.

Any Ideas?

like image 284
Daisy Avatar asked Dec 25 '22 02:12

Daisy


1 Answers

From the Javadoc for the Pattern class

A Unicode character can also be represented in a regular-expression by using its Hex notation(hexadecimal code point value) directly as described in construct \x{...}, for example a supplementary character U+2011F can be specified as \x{2011F}, instead of two consecutive Unicode escape sequences of the surrogate pair \uD840\uDD1F.

This means that the regular expression that you're looking for is ([\x{1F601}-\x{1F64F}]). Of course, when you write this as a Java String literal, you must escape the backslashes.

Pattern unicodeOutliers = Pattern.compile("([\\x{1F601}-\\x{1F64F}])");

Note that the construct \x{...} is only available from Java 7.

like image 156
Dawood ibn Kareem Avatar answered Feb 01 '23 10:02

Dawood ibn Kareem