How do I convert strings representing code points to the appropriate character? For example, I want to have a function which gets <code>U+00E4</code> and returns <code>ä</code>. I know that in the character class I have a function <code>toChars(int codePoint)</code> which takes an integer but there is no function which takes a string of this type. Is there a built in function or do I have to do some transformation on the string to get the integer which I can send to the function?

Code points are written as hexadecimal numbers prefixed by <code>U+</code> So,you can do this <pre class="prettyprint"><code>int codepoint=Integer.parseInt(yourString.substring(2),16); char[] ch=Character.toChars(codepoint); </code></pre>

Call this constructor on <code>String</code>. <pre class="prettyprint"><code>"\u00E4" new String(new int[] { 0x00E4 }, 0, 1); </code></pre>

The question asked for a function to convert a string value representing a Unicode code point (i.e. <code>"+Unnnn"</code> rather than the Java formats of <code>"\unnnn"</code> or <code>"0xnnnn</code>). However, newer releases of Java have enhancements which simplify the processing of a string contain multiple code points in Unicode format: <ul> <li>The introduction of Streams in Java 8.</li> <li>Method <code>public static String toString(int codePoint)</code> which was added to the <code>Character</code> class in Java 11. It returns a <code>String</code> rather than a <code>char[]</code>, so <code>Character.toString(0x00E4)</code> returns <code>"ä"</code>.</li> </ul> Those enhancements allow a different approach to solving the issue raised in the OP. This method transforms a set of code points in Unicode format to a readable <code>String</code> in a single statement: <pre class="prettyprint"><code>void processUnicode() { // Create a test string containing "Hello World 😁" with code points in Unicode format. // Include an invalid code point (+U0wxyz), and a code point outside the Unicode range (+U70FFFF). String data = "+U0048+U0065+U006c+U006c+U0wxyz+U006f+U0020+U0057+U70FFFF+U006f+U0072+U006c+U0000064+U20+U1f601"; String text = Arrays.stream(data.split("\\+U")) .filter(s -> ! s.isEmpty()) // First element returned by split() is a zero length string. .map(s -> { try { return Integer.parseInt(s, 16); } catch (NumberFormatException e) { System.out.println("Ignoring element [" + s + "]: NumberFormatException from parseInt(\"" + s + "\"}"); } return null; // If the code point is not represented as a valid hex String. }) .filter(v -> v != null) // Ignore syntactically invalid code points. .filter(i -> Character.isValidCodePoint(i)) // Ignore code points outside of Unicode range. .map(i -> Character.toString(i)) // Obtain the string value directly from the code point. (Requires JDK >= 11 ) .collect(Collectors.joining()); System.out.println(text); // Prints "Hello World 😁" } </code></pre> And this is the output: <pre class="prettyprint"><code>run: Ignoring element [0wxyz]: NumberFormatException from parseInt("0wxyz"} Hello World 😁 BUILD SUCCESSFUL (total time: 0 seconds) </code></pre> Notes: <ul> <li>With this approach there is no longer any need for a specific function to convert a code point in Unicode format. That's dispersed instead, through multiple intermediate operations in the <code>Stream</code> processing. Of course the same code could still be used to process just a single code point in Unicode format.</li> <li>It's easy to add intermediate operations to perform further validation and processing on the <code>Stream</code>, such as case conversion, removal of emoticons, etc.</li> </ul>

How do I convert unicode codepoints to their character representation?

3 Answers

Code points are written as hexadecimal numbers prefixed by U+

So,you can do this

int codepoint=Integer.parseInt(yourString.substring(2),16); char[] ch=Character.toChars(codepoint);

answered Sep 23 '22 15:09

Anirudha

Call this constructor on String.

"\u00E4"  new String(new int[] { 0x00E4 }, 0, 1);

answered Sep 24 '22 15:09

Joop Eggen

The question asked for a function to convert a string value representing a Unicode code point (i.e. "+Unnnn" rather than the Java formats of "\unnnn" or "0xnnnn). However, newer releases of Java have enhancements which simplify the processing of a string contain multiple code points in Unicode format:

The introduction of Streams in Java 8.
Method public static String toString(int codePoint) which was added to the Character class in Java 11. It returns a String rather than a char[], so Character.toString(0x00E4) returns "ä".

Those enhancements allow a different approach to solving the issue raised in the OP. This method transforms a set of code points in Unicode format to a readable String in a single statement:

void processUnicode() {

    // Create a test string containing "Hello World 😁" with code points in Unicode format.
    // Include an invalid code point (+U0wxyz), and a code point outside the Unicode range (+U70FFFF).
    String data = "+U0048+U0065+U006c+U006c+U0wxyz+U006f+U0020+U0057+U70FFFF+U006f+U0072+U006c+U0000064+U20+U1f601";

    String text = Arrays.stream(data.split("\\+U"))
            .filter(s -> ! s.isEmpty()) // First element returned by split() is a zero length string.
            .map(s -> {
                try {
                    return Integer.parseInt(s, 16);
                } catch (NumberFormatException e) { 
                    System.out.println("Ignoring element [" + s + "]: NumberFormatException from parseInt(\"" + s + "\"}");
                }
                return null; // If the code point is not represented as a valid hex String.
            })
            .filter(v -> v != null) // Ignore syntactically invalid code points.
            .filter(i -> Character.isValidCodePoint(i)) // Ignore code points outside of Unicode range.
            .map(i -> Character.toString(i)) // Obtain the string value directly from the code point. (Requires JDK >= 11 )
            .collect(Collectors.joining());

    System.out.println(text); // Prints "Hello World 😁"
}

And this is the output:

run:
Ignoring element [0wxyz]: NumberFormatException from parseInt("0wxyz"}
Hello World 😁
BUILD SUCCESSFUL (total time: 0 seconds)

Notes:

With this approach there is no longer any need for a specific function to convert a code point in Unicode format. That's dispersed instead, through multiple intermediate operations in the Stream processing. Of course the same code could still be used to process just a single code point in Unicode format.
It's easy to add intermediate operations to perform further validation and processing on the Stream, such as case conversion, removal of emoticons, etc.

answered Sep 22 '22 15:09

skomisa

Related questions
                            
                                Play 2.0+Java vs. Play 2.0+Scala?
                            
                                Design Pattern to implement Business Rules with hundreds of if else in java
                            
                                Show animated GIF
                            
                                Is the encoding name UTF8 or UTF-8?
                            
                                Slow startup on Tomcat 7.0.57 because of SecureRandom
                            
                                Is there a shorthand for creating a String constant in Spring context XML file?
                            
                                How do I create an embedded WebSocket server Jetty 9?
                            
                                Difference between BasicDatasource and PoolingDatasource
                            
                                Tomcat starts but home page cannot open with url http://localhost:8080
                            
                                Java double initialization
                            
                                Preventing System.exit() from API
                            
                                Servlet 3.0: where is @WebServletContextListener?
                            
                                How do I bundle a JRE into an EXE for a Java Application? Launch4j says "runtime is missing or corrupted."
                            
                                Java code evaluation (IntelliJ IDE), use toString() in some point?
                            
                                Why do spring/hibernate read-only database transactions run slower than read-write?
                            
                                What are practical uses of the java.util.function.Function.identity method?
                            
                                How do I make my ImageView a fixed size regardless of the size of the bitmap
                            
                                How can I get the memory location of a object in java?
                            
                                Writing to console with System.out and PrintWriter
                            
                                How does creating a instance of class inside of the class itself works?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I convert unicode codepoints to their character representation?

Tags:

java

unicode

David Michael Gang

People also ask

3 Answers

Anirudha

Joop Eggen

skomisa

Recent Activity

Donate For Us