I need to display the first symbol of a string. The simpliest code for this would be:
String text = "test string";
char firstSymbol = text[0];
But this doesn't work if the character doesn't fit 16 bits, for example "\uD83D\uDC68"
(👨, U+1F468
). Only half of the character is returned and it is rendered as question mark.
String text = "test string";
int codePoint = text.codePointAt(0);
char[] chars = Character.toChars(codePoint);
String firstSymbol = new String(chars);
This works well for any character that is represented in Unicode. However, there are sequences of Unicode characters are displayed as one symbol. When I run the code above for them only part of symbol is displayed as it happens for "\uD83D\uDC68\u200D\uD83D\uDCBB"
(👨💻). In this case I want the result to be the whole string. How can I handle such cases?
It should be charAt() of course, my fault. But char
is UTF-16 encoded and can't contain several characters. The first example should be this:
String text = "test string";
char firstSymbol = text.charAt(0);
Another tough example for one symbol is "\u0D23\u0D4D\u200D"
(ണ്). It has two characters and zero-width joiner at the end.
I have tried to use android.icu
library, which descends from ICU4J
, but unfortunately it is supported only starting from API 24. Moreover it produces the same result as the second example, i.e. it doesn't join characters if zero-width joiner is between them.
int breakIterator = BreakIterator.getCharacterInstance();
breakIterator.setText(text);
int begin = breakIterator.first();
int end = breakIterator.next();
String firstSymbol = text.substring(begin, end);
\u200D
is Unicode codepoint U+200D ZERO WIDTH JOINER
. If you want to extract a sequence of joined codepoints, you are going to have to iterate the string manually until you encounter a non-joined codepoint, eg:
String text = ...;
StringBuilder sequence = new StringBuilder(text.length());
boolean isInJoin = false;
int codePoint;
for (int i = 0; i < text.length(); i = text.offsetByCodePoints(i, 1))
{
codePoint = text.codePointAt(i);
if (codePoint == 0x200D)
{
isInJoin = true;
if (sequence.length() == 0)
continue;
}
else
{
if ((sequence.length() > 0) && (!isInJoin)) break;
isInJoin = false;
}
sequence.appendCodePoint(codePoint);
}
if (isInJoin)
{
for(int i = sequence.length()-1; i >= 0; --i)
{
if (sequence.charAt(i) == 0x200D)
sequence.deleteCharAt(i);
else
break;
}
}
String firstSymbols = sequence.toString();
Alternatively:
String text = ...;
boolean isInJoin = false;
int start = 0, length = 0, next;
int codePoint;
for (int i = 0; i < text.length(); i = next)
{
codePoint = text.codePointAt(i);
if (codePoint == 0x200D)
{
isInJoin = true;
if (length == 0)
{
next = text.offsetByCodePoints(i, 1);
start = next;
continue;
}
}
else
{
if ((length > 0) && (!isInJoin)) break;
isInJoin = false;
}
next = text.offsetByCodePoints(i, 1);
length += (next - i);
}
if (isInJoin)
{
for(int i = length-1; i >= 0; --i)
{
if (text.charAt(i) == 0x200D)
--length;
else
break;
}
}
String firstSymbols = text.substring(start, start+length);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With