Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the regex to extract all the emojis from a string?

I have a String encoded in UTF-8. For example:

Thats a nice joke πŸ˜†πŸ˜†πŸ˜† πŸ˜› 

I have to extract all the emojis present in the sentence. And the emoji could be any

When this sentence is viewed in terminal using command less text.txt it is viewed as:

Thats a nice joke <U+1F606><U+1F606><U+1F606> <U+1F61B> 

This is the corresponding UTF code for the emoji. All the codes for emojis can be found at emojitracker.

For the purpose of finding all the occurances, I used a regular expression pattern (<U\+\w+?>) but it didnt work for the UTF-8 encoded string.

Following is my code:

    String s="Thats a nice joke πŸ˜†πŸ˜†πŸ˜† πŸ˜›";     Pattern pattern = Pattern.compile("(<U\\+\\w+?>)");     Matcher matcher = pattern.matcher(s);     List<String> matchList = new ArrayList<String>();      while (matcher.find()) {         matchList.add(matcher.group());     }      for(int i=0;i<matchList.size();i++){         System.out.println(matchList.get(i));      } 

This pdf says Range: 1F300–1F5FF for Miscellaneous Symbols and Pictographs. So I want to capture any character lying within this range.

like image 220
vishalaksh Avatar asked Jul 19 '14 12:07

vishalaksh


People also ask

Does regex work with Emojis?

emoji-regex offers a regular expression to match all emoji symbols and sequences (including textual representations of emoji) as per the Unicode Standard.

How do I get Emojis on my text?

You'll want to go to Settings > General, then scroll down and tap on Keyboard. Below a handful of toggle settings like Auto-Capitalization is the Keyboards setting. Tap that, then tap "Add New Keyboard." There, sandwiched between non-English language keyboards is the Emoji keyboard. Select it.

How do I get Emojis in Python?

Using emoji module: Emojis can also be implemented by using the emoji module provided in Python. To install it run the following in the terminal. emojize() function requires the CLDR short name to be passed in it as the parameter. It then returns the corresponding emoji.

How do I remove emoji strings?

Instead of removing Emoji characters, you can only include alphabets and numbers. A simple tr should do the trick, . tr('^A-Za-z0-9', '') .


2 Answers

Using emoji-java i've wrote a simple method that removes all emojis including fitzpatrick modifiers. Requires an external library but easier to maintain than those monster regexes.

Use:

String input = "A string πŸ˜„with a \uD83D\uDC66\uD83C\uDFFFfew πŸ˜‰emojis!"; String result = EmojiParser.removeAllEmojis(input); 

emoji-java maven installation:

<dependency>   <groupId>com.vdurmont</groupId>   <artifactId>emoji-java</artifactId>   <version>3.1.3</version> </dependency> 

gradle:

implementation 'com.vdurmont:emoji-java:3.1.3' 

EDIT: previously submitted answer was pulled into emoji-java source code.

like image 102
gidim Avatar answered Sep 26 '22 02:09

gidim


the pdf that you just mentioned says Range: 1F300–1F5FF for Miscellaneous Symbols and Pictographs. So lets say I want to capture any character lying within this range. Now what to do?

Okay, but I will just note that the emoji in your question are outside that range! :-)

The fact that these are above 0xFFFF complicates things, because Java strings store UTF-16. So we can't just use one simple character class for it. We're going to have surrogate pairs. (More: http://www.unicode.org/faq/utf_bom.html)

U+1F300 in UTF-16 ends up being the pair \uD83C\uDF00; U+1F5FF ends up being \uD83D\uDDFF. Note that the first character went up, we cross at least one boundary. So we have to know what ranges of surrogate pairs we're looking for.

Not being steeped in knowledge about the inner workings of UTF-16, I wrote a program to find out (source at the endΒ β€” I'd double-check it if I were you, rather than trusting me). It tells me we're looking for \uD83C followed by anything in the range \uDF00-\uDFFF (inclusive), or \uD83D followed by anything in the range \uDC00-\uDDFF (inclusive).

So armed with that knowledge, in theory we could now write a pattern:

// This is wrong, keep reading Pattern p = Pattern.compile("(?:\uD83C[\uDF00-\uDFFF])|(?:\uD83D[\uDC00-\uDDFF])"); 

That's an alternation of two non-capturing groups, the first group for the pairs starting with \uD83C, and the second group for the pairs starting with \uD83D.

But that fails (doesn't find anything). I'm fairly sure it's because we're trying to specify half of a surrogate pair in various places:

Pattern p = Pattern.compile("(?:\uD83C[\uDF00-\uDFFF])|(?:\uD83D[\uDC00-\uDDFF])"); // Half of a pair --------------^------^------^-----------^------^------^ 

We can't just split up surrogate pairs like that, they're called surrogate pairs for a reason. :-)

Consequently, I don't think we can use regular expressions (or indeed, any string-based approach) for this at all. I think we have to search through char arrays.

char arrays hold UTF-16 values, so we can find those half-pairs in the data if we look for it the hard way:

String s = new StringBuilder()                 .append("Thats a nice joke ")                 .appendCodePoint(0x1F606)                 .appendCodePoint(0x1F606)                 .appendCodePoint(0x1F606)                 .append(" ")                 .appendCodePoint(0x1F61B)                 .toString(); char[] chars = s.toCharArray(); int index; char ch1; char ch2;  index = 0; while (index < chars.length - 1) { // -1 because we're looking for two-char-long things     ch1 = chars[index];     if ((int)ch1 == 0xD83C) {         ch2 = chars[index+1];         if ((int)ch2 >= 0xDF00 && (int)ch2 <= 0xDFFF) {             System.out.println("Found emoji at index " + index);             index += 2;             continue;         }     }     else if ((int)ch1 == 0xD83D) {         ch2 = chars[index+1];         if ((int)ch2 >= 0xDC00 && (int)ch2 <= 0xDDFF) {             System.out.println("Found emoji at index " + index);             index += 2;             continue;         }     }     ++index; } 

Obviously that's just debug-level code, but it does the job. (In your given string, with its emoji, of course it won't find anything as they're outside the range. But if you change the upper bound on the second pair to 0xDEFF instead of 0xDDFF, it will. No idea if that would also include non-emojis, though.)


Source of my program to find out what the surrogate ranges were:

public class FindRanges {      public static void main(String[] args) {         char last0 = '\0';         char last1 = '\0';         for (int x = 0x1F300; x <= 0x1F5FF; ++x) {             char[] chars = new StringBuilder().appendCodePoint(x).toString().toCharArray();             if (chars[0] != last0) {                 if (last0 != '\0') {                     System.out.println("-\\u" + Integer.toHexString((int)last1).toUpperCase());                 }                 System.out.print("\\u" + Integer.toHexString((int)chars[0]).toUpperCase() + " \\u" + Integer.toHexString((int)chars[1]).toUpperCase());                 last0 = chars[0];             }             last1 = chars[1];         }         if (last0 != '\0') {             System.out.println("-\\u" + Integer.toHexString((int)last1).toUpperCase());         }     } } 

Output:

\uD83C \uDF00-\uDFFF \uD83D \uDC00-\uDDFF
like image 25
T.J. Crowder Avatar answered Sep 22 '22 02:09

T.J. Crowder