Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting a String that contains emojis

I need to split a String that may or may not contain emojis into a list of individual characters (keeping the emojis intact). Currently, and as is to be expected, any emoji is split into its individual parts.

String s = "πŸ™πŸ™‚abcπŸ™";
String[] tokens = s.split("");
// tokens is ["?","?","?","?","a","b","c","?","?"]
// tokens should be ["πŸ™","πŸ™‚","a","b","c","πŸ™"]

I want to keep project size to a minimum and with few to no dependencies, so I want to stay away from any 3rd party libraries. The exact output type doesn't matter too much, so long as I can at least iterate through the tokens in order.

like image 767
Michael Bianconi Avatar asked Oct 25 '25 16:10

Michael Bianconi


1 Answers

You may match and extract all Unicode code points consisting of base char and any amount of diacritics after that char:

\P{M}\p{M}*+

It matches any char other than a diacritic and then any 0+ diacritic chars.

Java 9+ demo:

import java.util.*;
import java.util.stream.*;
import java.util.regex.*;

class Ideone
{
    public static void main (String[] args) throws java.lang.Exception
    {
        String s = "πŸ™πŸ™‚abcπŸ™";
        List<String> results = Pattern.compile("\\P{M}\\p{M}*+").matcher(s)
            .results()
            .map(MatchResult::group)
            .collect(Collectors.toList());
        System.out.println(results); 
    }
}
// => [πŸ™, πŸ™‚, a, b, c, πŸ™]

In earlier Java versions, you may use

import java.util.regex.*;
//.....
String s = "πŸ™πŸ™‚abcπŸ™";
List<String> results = new ArrayList<>();
Matcher m = Pattern.compile("\\P{M}\\p{M}*+").matcher(s);
while (m.find()) {
    results.add(m.group());
}
System.out.println(results);  // => [πŸ™, πŸ™‚, a, b, c, πŸ™]

See another Java demo

like image 52
Wiktor StribiΕΌew Avatar answered Oct 27 '25 04:10

Wiktor StribiΕΌew