I have come across this line of legacy code, which I am trying to figure out:
String newString = new String(oldString.getBytes("UTF-8"), "UTF-8"));
As far as I can understand, it is encoding & decoding using the same charSet.
How is this different from the following?
String newString = oldString;
Is there any scenario in which the two lines will have different outputs?
p.s.: Just to clarify, yes I am aware of the excellent article on encoding by Joel Spolsky !
String objects in Java are encoded in UTF-16. Java Platform is required to support other character encodings or charsets such as US-ASCII, ISO-8859-1, and UTF-8. Errors may occur when converting between differently coded character data. There are two general types of encoding errors.
In order to convert a String into UTF-8, we use the getBytes() method in Java. The getBytes() method encodes a String into a sequence of bytes and returns a byte array. where charsetName is the specific charset by which the String is encoded into an array of bytes.
The native character encoding of the Java programming language is UTF-16. A charset in the Java platform therefore defines a mapping between sequences of sixteen-bit UTF-16 code units (that is, sequences of chars) and sequences of bytes.
This could be complicated way of doing
String newString = new String(oldString);
This shortens the String is the underlying char[] used is much longer.
However more specifically it will be checking that every character can be UTF-8 encoded.
There are some "characters" you can have in a String which cannot be encoded and these would be turned into ?
Any character between \uD800 and \uDFFF cannot be encoded and will be turned into '?'
String oldString = "\uD800";
String newString = new String(oldString.getBytes("UTF-8"), "UTF-8");
System.out.println(newString.equals(oldString));
prints
false
How is this different from the following?
This line of code here:
String newString = new String(oldString.getBytes("UTF-8"), "UTF-8"));
constructs a new String object (i.e. a copy of oldString
), while this line of code:
String newString = oldString;
declares a new variable of type java.lang.String
and initializes it to refer to the same String object as the variable oldString
.
Is there any scenario in which the two lines will have different outputs?
Absolutely:
String newString = oldString;
boolean isSameInstance = newString == oldString; // isSameInstance == true
vs.
String newString = new String(oldString.getBytes("UTF-8"), "UTF-8"));
// isSameInstance == false (in most cases)
boolean isSameInstance = newString == oldString;
a_horse_with_no_name (see comment) is right of course. The equivalent of
String newString = new String(oldString.getBytes("UTF-8"), "UTF-8"));
is
String newString = new String(oldString);
minus the subtle difference wrt the encoding that Peter Lawrey explains in his answer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With