Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to count grapheme clusters or "perceived" emoji characters in Java

I'm looking to count the number of perceived emoji characters in a provided Java string. I'm currently using the emoji4j library, but it doesn't work for grapheme clusters like this one: 👩‍👩‍👦‍👦

Calling EmojiUtil.getLength("👩‍👩‍👦‍👦") returns 4 instead of 1, and similarly calling EmojiUtil.getLength("👻👩‍👩‍👦‍👦") returns 5 instead of 2.

Are there any APIs or methods on String in Java that make it easy to count grapheme clusters?

I've been hunting around but understandably the codePoints() method on a String includes not only the visible emojis, but also the zero width joiners.

I also attempted this using the BreakIterator:

public static int getLength(String emoji) {
    BreakIterator it = BreakIterator.getCharacterInstance();
    it.setText(emoji);
    int emojiCount = 0;
    while (it.next() != BreakIterator.DONE) {
        emojiCount++;
    }
    return emojiCount;
}

But it seems to behave identically to the codePoints() method, returning 8 for something like "👻👩‍👩‍👦‍👦".

like image 697
Craig Otis Avatar asked Nov 30 '16 01:11

Craig Otis


People also ask

Are Emojis Graphemes?

Emojis clearly are not phonemes because they don't represent sounds. They could be considered graphemes because they are their own smallest typographic unit, and in the same way they can be considered morphemes because they are their own smallest grammatical units.

Can you use emojis in Java code?

emoji-java is a lightweight java library that helps you use Emojis in your java applications.

What is a grapheme character?

Grapheme, or more fully, a grapheme cluster string is a single user-visible character, which in turn may be several characters (codepoints) long. For example … a “ȫ” is a single grapheme but one, two, or even three characters, depending on normalization.


1 Answers

I ended up using the ICU library, which worked much better. No changes (aside from import statements) were needed from my original codeblock, as it simply provides a different implementation of BreakIterator.

like image 55
Craig Otis Avatar answered Oct 05 '22 14:10

Craig Otis