How do I convert strings representing code points to the appropriate character?
For example, I want to have a function which gets U+00E4
and returns ä
.
I know that in the character class I have a function toChars(int codePoint)
which takes an integer but there is no function which takes a string of this type.
Is there a built in function or do I have to do some transformation on the string to get the integer which I can send to the function?
You CAN'T convert from Unicode to ASCII. Almost every character in Unicode cannot be expressed in ASCII, and those that can be expressed have exactly the same codepoints in ASCII as in UTF-8, which is probably what you have.
World's simplest unicode tool. This browser-based utility converts fancy Unicode text back to regular text. All Unicode glyphs that you paste or enter in the text area as the input automatically get converted to simple ASCII characters in the output.
We can determine the unicode category for a particular character by using the getType() method. It is a static method of Character class and it returns an integer value of char ch representing in unicode general category.
Code points are written as hexadecimal numbers prefixed by U+
So,you can do this
int codepoint=Integer.parseInt(yourString.substring(2),16); char[] ch=Character.toChars(codepoint);
Call this constructor on String
.
"\u00E4" new String(new int[] { 0x00E4 }, 0, 1);
The question asked for a function to convert a string value representing a Unicode code point (i.e. "+Unnnn"
rather than the Java formats of "\unnnn"
or "0xnnnn
). However, newer releases of Java have enhancements which simplify the processing of a string contain multiple code points in Unicode format:
public static String toString(int codePoint)
which was added to the Character
class in Java 11. It returns a String
rather than a char[]
, so Character.toString(0x00E4)
returns "ä"
.Those enhancements allow a different approach to solving the issue raised in the OP. This method transforms a set of code points in Unicode format to a readable String
in a single statement:
void processUnicode() {
// Create a test string containing "Hello World 😁" with code points in Unicode format.
// Include an invalid code point (+U0wxyz), and a code point outside the Unicode range (+U70FFFF).
String data = "+U0048+U0065+U006c+U006c+U0wxyz+U006f+U0020+U0057+U70FFFF+U006f+U0072+U006c+U0000064+U20+U1f601";
String text = Arrays.stream(data.split("\\+U"))
.filter(s -> ! s.isEmpty()) // First element returned by split() is a zero length string.
.map(s -> {
try {
return Integer.parseInt(s, 16);
} catch (NumberFormatException e) {
System.out.println("Ignoring element [" + s + "]: NumberFormatException from parseInt(\"" + s + "\"}");
}
return null; // If the code point is not represented as a valid hex String.
})
.filter(v -> v != null) // Ignore syntactically invalid code points.
.filter(i -> Character.isValidCodePoint(i)) // Ignore code points outside of Unicode range.
.map(i -> Character.toString(i)) // Obtain the string value directly from the code point. (Requires JDK >= 11 )
.collect(Collectors.joining());
System.out.println(text); // Prints "Hello World 😁"
}
And this is the output:
run:
Ignoring element [0wxyz]: NumberFormatException from parseInt("0wxyz"}
Hello World 😁
BUILD SUCCESSFUL (total time: 0 seconds)
Notes:
Stream
processing. Of course the same code could still be used to process just a single code point in Unicode format.Stream
, such as case conversion, removal of emoticons, etc.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With