Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get the first symbol of a string

I need to display the first symbol of a string. The simpliest code for this would be:

String text = "test string";
char firstSymbol = text[0];

But this doesn't work if the character doesn't fit 16 bits, for example "\uD83D\uDC68" (👨, U+1F468). Only half of the character is returned and it is rendered as question mark.

String text = "test string";
int codePoint = text.codePointAt(0);
char[] chars = Character.toChars(codePoint);
String firstSymbol = new String(chars);

This works well for any character that is represented in Unicode. However, there are sequences of Unicode characters are displayed as one symbol. When I run the code above for them only part of symbol is displayed as it happens for "\uD83D\uDC68\u200D\uD83D\uDCBB" (👨‍💻). In this case I want the result to be the whole string. How can I handle such cases?


It should be charAt() of course, my fault. But char is UTF-16 encoded and can't contain several characters. The first example should be this:

String text = "test string";
char firstSymbol = text.charAt(0);

Another tough example for one symbol is "\u0D23\u0D4D\u200D" (ണ്‍). It has two characters and zero-width joiner at the end.


I have tried to use android.icu library, which descends from ICU4J, but unfortunately it is supported only starting from API 24. Moreover it produces the same result as the second example, i.e. it doesn't join characters if zero-width joiner is between them.

int breakIterator = BreakIterator.getCharacterInstance();
breakIterator.setText(text);
int begin = breakIterator.first();
int end = breakIterator.next();
String firstSymbol = text.substring(begin, end);
like image 926
fdermishin Avatar asked Mar 08 '23 07:03

fdermishin


1 Answers

\u200D is Unicode codepoint U+200D ZERO WIDTH JOINER. If you want to extract a sequence of joined codepoints, you are going to have to iterate the string manually until you encounter a non-joined codepoint, eg:

String text = ...;
StringBuilder sequence = new StringBuilder(text.length());
boolean isInJoin = false;
int codePoint;

for (int i = 0; i < text.length(); i = text.offsetByCodePoints(i, 1))
{
    codePoint = text.codePointAt(i);

    if (codePoint == 0x200D)
    {
        isInJoin = true;
        if (sequence.length() == 0)
            continue;
    }
    else
    {
        if ((sequence.length() > 0) && (!isInJoin)) break;
        isInJoin = false;
    }

    sequence.appendCodePoint(codePoint);
}

if (isInJoin)
{
    for(int i = sequence.length()-1; i >= 0; --i)
    {
        if (sequence.charAt(i) == 0x200D)
            sequence.deleteCharAt(i);
        else
            break;
    }
}

String firstSymbols = sequence.toString();

Alternatively:

String text = ...;
boolean isInJoin = false;
int start = 0, length = 0, next;
int codePoint;

for (int i = 0; i < text.length(); i = next)
{
    codePoint = text.codePointAt(i);

    if (codePoint == 0x200D)
    {
        isInJoin = true;
        if (length == 0)
        {
            next = text.offsetByCodePoints(i, 1);
            start = next;
            continue;
        }
    }
    else
    {
        if ((length > 0) && (!isInJoin)) break;
        isInJoin = false;
    }

    next = text.offsetByCodePoints(i, 1);
    length += (next - i);
}

if (isInJoin)
{
    for(int i = length-1; i >= 0; --i)
    {
        if (text.charAt(i) == 0x200D)
            --length;
        else
            break;
    }
}

String firstSymbols = text.substring(start, start+length);
like image 136
Remy Lebeau Avatar answered Mar 15 '23 16:03

Remy Lebeau