I am trying to replace emoji from Arabic tweets using java.
I used this code:
String line = "اييه تقولي اجل الارسنال تعادل امس بعد ما كان فايز 😂😂";
Pattern unicodeOutliers = Pattern.compile("([\u1F601-\u1F64F])", Pattern.UNICODE_CASE | Pattern.CANON_EQ | Pattern.CASE_INSENSITIVE);
Matcher unicodeOutlierMatcher = unicodeOutliers.matcher(line);
line = unicodeOutlierMatcher.replaceAll(" $1 ");
But it is not replacing them. Even if I am matching only the character itself "\u1F602" it is not replacing it. May be because it is 5 digits after the u?! I am not sure, just a guess.
Note that:
1- the emotion at the end of the tweet (😂) is the "U+1F602" which is "face with tears of joy"
2- this question is not a duplicate for this question.
Any Ideas?
From the Javadoc for the Pattern
class
A Unicode character can also be represented in a regular-expression by using its Hex notation(hexadecimal code point value) directly as described in construct
\x{...}
, for example a supplementary character U+2011F can be specified as\x{2011F}
, instead of two consecutive Unicode escape sequences of the surrogate pair\uD840\uDD1F
.
This means that the regular expression that you're looking for is ([\x{1F601}-\x{1F64F}])
. Of course, when you write this as a Java String
literal, you must escape the backslashes.
Pattern unicodeOutliers = Pattern.compile("([\\x{1F601}-\\x{1F64F}])");
Note that the construct \x{...}
is only available from Java 7.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With