Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I convert unicode codepoints to their character representation?

Tags:

java

unicode

How do I convert strings representing code points to the appropriate character?

For example, I want to have a function which gets U+00E4 and returns ä.

I know that in the character class I have a function toChars(int codePoint) which takes an integer but there is no function which takes a string of this type.

Is there a built in function or do I have to do some transformation on the string to get the integer which I can send to the function?

like image 244
David Michael Gang Avatar asked Aug 22 '13 12:08

David Michael Gang


People also ask

How do I convert Unicode to ASCII?

You CAN'T convert from Unicode to ASCII. Almost every character in Unicode cannot be expressed in ASCII, and those that can be expressed have exactly the same codepoints in ASCII as in UTF-8, which is probably what you have.

Can we convert Unicode to text?

World's simplest unicode tool. This browser-based utility converts fancy Unicode text back to regular text. All Unicode glyphs that you paste or enter in the text area as the input automatically get converted to simple ASCII characters in the output.

How do I find Unicode value of a character?

We can determine the unicode category for a particular character by using the getType() method. It is a static method of Character class and it returns an integer value of char ch representing in unicode general category.


3 Answers

Code points are written as hexadecimal numbers prefixed by U+

So,you can do this

int codepoint=Integer.parseInt(yourString.substring(2),16); char[] ch=Character.toChars(codepoint); 
like image 62
Anirudha Avatar answered Sep 23 '22 15:09

Anirudha


Call this constructor on String.

"\u00E4"  new String(new int[] { 0x00E4 }, 0, 1); 
like image 25
Joop Eggen Avatar answered Sep 24 '22 15:09

Joop Eggen


The question asked for a function to convert a string value representing a Unicode code point (i.e. "+Unnnn" rather than the Java formats of "\unnnn" or "0xnnnn). However, newer releases of Java have enhancements which simplify the processing of a string contain multiple code points in Unicode format:

  • The introduction of Streams in Java 8.
  • Method public static String toString​(int codePoint) which was added to the Character class in Java 11. It returns a String rather than a char[], so Character.toString(0x00E4) returns "ä".

Those enhancements allow a different approach to solving the issue raised in the OP. This method transforms a set of code points in Unicode format to a readable String in a single statement:

void processUnicode() {

    // Create a test string containing "Hello World 😁" with code points in Unicode format.
    // Include an invalid code point (+U0wxyz), and a code point outside the Unicode range (+U70FFFF).
    String data = "+U0048+U0065+U006c+U006c+U0wxyz+U006f+U0020+U0057+U70FFFF+U006f+U0072+U006c+U0000064+U20+U1f601";

    String text = Arrays.stream(data.split("\\+U"))
            .filter(s -> ! s.isEmpty()) // First element returned by split() is a zero length string.
            .map(s -> {
                try {
                    return Integer.parseInt(s, 16);
                } catch (NumberFormatException e) { 
                    System.out.println("Ignoring element [" + s + "]: NumberFormatException from parseInt(\"" + s + "\"}");
                }
                return null; // If the code point is not represented as a valid hex String.
            })
            .filter(v -> v != null) // Ignore syntactically invalid code points.
            .filter(i -> Character.isValidCodePoint(i)) // Ignore code points outside of Unicode range.
            .map(i -> Character.toString(i)) // Obtain the string value directly from the code point. (Requires JDK >= 11 )
            .collect(Collectors.joining());

    System.out.println(text); // Prints "Hello World 😁"
}

And this is the output:

run:
Ignoring element [0wxyz]: NumberFormatException from parseInt("0wxyz"}
Hello World 😁
BUILD SUCCESSFUL (total time: 0 seconds)

Notes:

  • With this approach there is no longer any need for a specific function to convert a code point in Unicode format. That's dispersed instead, through multiple intermediate operations in the Stream processing. Of course the same code could still be used to process just a single code point in Unicode format.
  • It's easy to add intermediate operations to perform further validation and processing on the Stream, such as case conversion, removal of emoticons, etc.
like image 22
skomisa Avatar answered Sep 22 '22 15:09

skomisa