Given the following code: <pre class="prettyprint"><code>String tmp = new String("\\u0068\\u0065\\u006c\\u006c\\u006f\\u000a"); String result = convertToEffectiveString(tmp); // result contain now "hello\n" </code></pre> Does the JDK already provide some classes for doing this ? Is there a libray that does this ? (preferably under maven) I have tried with ByteArrayOutputStream with no success.

Firstly, are you just trying to parse a string literal, or is <code>tmp</code> going to be some user-entered data? If this is going to be a string literal (i.e. hard-coded string), it can be encoded using Unicode escapes. In your case, this just means using single backslashes instead of double backslashes: <pre class="prettyprint"><code>String result = "\u0068\u0065\u006c\u006c\u006f\u000a"; </code></pre> If, however, you need to use Java's string parsing rules to parse user input, a good starting point might be Apache Commons Lang's StringEscapeUtils.unescapeJava() method.

How to parse UTF-8 representation to String in Java?

Given the following code:

String tmp = new String("\\u0068\\u0065\\u006c\\u006c\\u006f\\u000a");

String result = convertToEffectiveString(tmp); // result contain now "hello\n"

Does the JDK already provide some classes for doing this ? Is there a libray that does this ? (preferably under maven)

I have tried with ByteArrayOutputStream with no success.

Is Java a UTF-8 string?

String objects in Java are encoded in UTF-16. Java Platform is required to support other character encodings or charsets such as US-ASCII, ISO-8859-1, and UTF-8. Errors may occur when converting between differently coded character data. There are two general types of encoding errors.

How do you parse a string in Java?

String parsing in java can be done by using a wrapper class. Using the Split method, a String can be converted to an array by passing the delimiter to the split method. The split method is one of the methods of the wrapper class. String parsing can also be done through StringTokenizer.

What is StandardCharsets UTF_8 in Java?

Introduction. When working with Strings in Java, we oftentimes need to encode them to a specific charset, such as UTF-8. UTF-8 represents a variable-width character encoding that uses between one and four eight-bit bytes to represent all valid Unicode code points.

How do I convert to UTF-8 in Java?

In order to convert a String into UTF-8, we use the getBytes() method in Java. The getBytes() method encodes a String into a sequence of bytes and returns a byte array. where charsetName is the specific charset by which the String is encoded into an array of bytes.

This works, but only with ASCII. If you use unicode characters outside of the ASCCI range, then you will have problems (as each character is being stuffed into a byte, instead of a full word that is allowed by UTF-8). You can do the typecast below because you know that the UTF-8 will not overflow one byte if you guaranteed that the input is basically ASCII (as you mention in your comments).

package sample;

import java.io.UnsupportedEncodingException;

public class UnicodeSample {
    public static final int HEXADECIMAL = 16;

    public static void main(String[] args) {

        try {
            String str = "\\u0068\\u0065\\u006c\\u006c\\u006f\\u000a";

            String arr[] = str.replaceAll("\\\\u"," ").trim().split(" ");
            byte[] utf8 = new byte[arr.length];

            int index=0;
            for (String ch : arr) {
                utf8[index++] = (byte)Integer.parseInt(ch,HEXADECIMAL);
            }

            String newStr = new String(utf8, "UTF-8");
            System.out.println(newStr);

        }
        catch (UnsupportedEncodingException e) {
            // handle the UTF-8 conversion exception
        }
    }
}

Here is another solution that fixes the issue of only working with ASCII characters. This will work with any unicode characters in the UTF-8 range instead of ASCII only in the first 8-bits of the range. Thanks to deceze for the questions. You made me think more about the problem and solution.

package sample;

import java.io.UnsupportedEncodingException;
import java.util.ArrayList;

public class UnicodeSample {
    public static final int HEXADECIMAL = 16;

    public static void main(String[] args) {

        try {
            String str = "\\u0068\\u0065\\u006c\\u006c\\u006f\\u000a\\u3fff\\uf34c";

            ArrayList<Byte> arrList = new ArrayList<Byte>();
            String codes[] = str.replaceAll("\\\\u"," ").trim().split(" ");

            for (String c : codes) {

                int code = Integer.parseInt(c,HEXADECIMAL);
                byte[] bytes = intToByteArray(code);

                for (byte b : bytes) {
                    if (b != 0) arrList.add(b);
                }
            }

            byte[] utf8 = new byte[arrList.size()];
            for (int i=0; i<arrList.size(); i++) utf8[i] = arrList.get(i);

            str = new String(utf8, "UTF-8");
            System.out.println(str);
        }
        catch (UnsupportedEncodingException e) {
            // handle the exception when
        }
    }

    // Takes a 4 byte integer and and extracts each byte
    public static final byte[] intToByteArray(int value) {
        return new byte[] {
                (byte) (value >>> 24),
                (byte) (value >>> 16),
                (byte) (value >>> 8),
                (byte) (value)
        };
    }
}

Firstly, are you just trying to parse a string literal, or is tmp going to be some user-entered data?

If this is going to be a string literal (i.e. hard-coded string), it can be encoded using Unicode escapes. In your case, this just means using single backslashes instead of double backslashes:

String result = "\u0068\u0065\u006c\u006c\u006f\u000a";

If, however, you need to use Java's string parsing rules to parse user input, a good starting point might be Apache Commons Lang's StringEscapeUtils.unescapeJava() method.

How to parse UTF-8 representation to String in Java?

Tags:

java

ascii

utf-8

Stephan

People also ask

2 Answers

jmq

prunge

Recent Activity

Donate For Us

How to parse UTF-8 representation to String in Java?

Tags:

java

ascii

utf-8

Stephan

People also ask

2 Answers

jmq

prunge

Related questions

Recent Activity

Donate For Us