I need to split a String that may or may not contain emojis into a list of individual characters (keeping the emojis intact). Currently, and as is to be expected, any emoji is split into its individual parts.
String s = "ππabcπ";
String[] tokens = s.split("");
// tokens is ["?","?","?","?","a","b","c","?","?"]
// tokens should be ["π","π","a","b","c","π"]
I want to keep project size to a minimum and with few to no dependencies, so I want to stay away from any 3rd party libraries. The exact output type doesn't matter too much, so long as I can at least iterate through the tokens in order.
You may match and extract all Unicode code points consisting of base char and any amount of diacritics after that char:
\P{M}\p{M}*+
It matches any char other than a diacritic and then any 0+ diacritic chars.
Java 9+ demo:
import java.util.*;
import java.util.stream.*;
import java.util.regex.*;
class Ideone
{
public static void main (String[] args) throws java.lang.Exception
{
String s = "ππabcπ";
List<String> results = Pattern.compile("\\P{M}\\p{M}*+").matcher(s)
.results()
.map(MatchResult::group)
.collect(Collectors.toList());
System.out.println(results);
}
}
// => [π, π, a, b, c, π]
In earlier Java versions, you may use
import java.util.regex.*;
//.....
String s = "ππabcπ";
List<String> results = new ArrayList<>();
Matcher m = Pattern.compile("\\P{M}\\p{M}*+").matcher(s);
while (m.find()) {
results.add(m.group());
}
System.out.println(results); // => [π, π, a, b, c, π]
See another Java demo
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With