Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java String encoding (UTF-8)

I have come across this line of legacy code, which I am trying to figure out:

String newString = new String(oldString.getBytes("UTF-8"), "UTF-8"));

As far as I can understand, it is encoding & decoding using the same charSet.

How is this different from the following?

String newString = oldString;

Is there any scenario in which the two lines will have different outputs?

p.s.: Just to clarify, yes I am aware of the excellent article on encoding by Joel Spolsky !

like image 232
OceanBlue Avatar asked Jan 13 '12 16:01

OceanBlue


People also ask

Are Java strings UTF-8?

String objects in Java are encoded in UTF-16. Java Platform is required to support other character encodings or charsets such as US-ASCII, ISO-8859-1, and UTF-8. Errors may occur when converting between differently coded character data. There are two general types of encoding errors.

How do I encode a String in UTF-8?

In order to convert a String into UTF-8, we use the getBytes() method in Java. The getBytes() method encodes a String into a sequence of bytes and returns a byte array. where charsetName is the specific charset by which the String is encoded into an array of bytes.

Is Java UTF-8 or 16?

The native character encoding of the Java programming language is UTF-16. A charset in the Java platform therefore defines a mapping between sequences of sixteen-bit UTF-16 code units (that is, sequences of chars) and sequences of bytes.


2 Answers

This could be complicated way of doing

String newString = new String(oldString);

This shortens the String is the underlying char[] used is much longer.

However more specifically it will be checking that every character can be UTF-8 encoded.

There are some "characters" you can have in a String which cannot be encoded and these would be turned into ?

Any character between \uD800 and \uDFFF cannot be encoded and will be turned into '?'

String oldString = "\uD800";
String newString = new String(oldString.getBytes("UTF-8"), "UTF-8");
System.out.println(newString.equals(oldString));

prints

false
like image 61
Peter Lawrey Avatar answered Sep 17 '22 22:09

Peter Lawrey


How is this different from the following?

This line of code here:

String newString = new String(oldString.getBytes("UTF-8"), "UTF-8"));

constructs a new String object (i.e. a copy of oldString), while this line of code:

String newString = oldString;

declares a new variable of type java.lang.String and initializes it to refer to the same String object as the variable oldString.

Is there any scenario in which the two lines will have different outputs?

Absolutely:

String newString = oldString;
boolean isSameInstance = newString == oldString; // isSameInstance == true

vs.

String newString = new String(oldString.getBytes("UTF-8"), "UTF-8"));
 // isSameInstance == false (in most cases)    
boolean isSameInstance = newString == oldString;

a_horse_with_no_name (see comment) is right of course. The equivalent of

String newString = new String(oldString.getBytes("UTF-8"), "UTF-8"));

is

String newString = new String(oldString);

minus the subtle difference wrt the encoding that Peter Lawrey explains in his answer.

like image 32
afrischke Avatar answered Sep 16 '22 22:09

afrischke