I have come across this line of legacy code, which I am trying to figure out:
String newString = new String(oldString.getBytes("UTF-8"), "UTF-8"));
As far as I can understand, it is encoding & decoding using the same charSet.
How is this different from the following?
String newString = oldString;
Is there any scenario in which the two lines will have different outputs?
p.s.: Just to clarify, yes I am aware of the excellent article on encoding by Joel Spolsky !
String objects in Java are encoded in UTF-16. Java Platform is required to support other character encodings or charsets such as US-ASCII, ISO-8859-1, and UTF-8. Errors may occur when converting between differently coded character data. There are two general types of encoding errors.
In order to convert a String into UTF-8, we use the getBytes() method in Java. The getBytes() method encodes a String into a sequence of bytes and returns a byte array. where charsetName is the specific charset by which the String is encoded into an array of bytes.
The native character encoding of the Java programming language is UTF-16. A charset in the Java platform therefore defines a mapping between sequences of sixteen-bit UTF-16 code units (that is, sequences of chars) and sequences of bytes.
This could be complicated way of doing
String newString = new String(oldString);
This shortens the String is the underlying char[] used is much longer.
However more specifically it will be checking that every character can be UTF-8 encoded.
There are some "characters" you can have in a String which cannot be encoded and these would be turned into ?
Any character between \uD800 and \uDFFF cannot be encoded and will be turned into '?'
String oldString = "\uD800";
String newString = new String(oldString.getBytes("UTF-8"), "UTF-8");
System.out.println(newString.equals(oldString));
prints
false
                        How is this different from the following?
This line of code here:
String newString = new String(oldString.getBytes("UTF-8"), "UTF-8"));
constructs a new String object (i.e. a copy of oldString), while this line of code:
String newString = oldString;
declares a new variable of type java.lang.String and initializes it to refer to the same String object as the variable oldString.
Is there any scenario in which the two lines will have different outputs?
Absolutely:
String newString = oldString;
boolean isSameInstance = newString == oldString; // isSameInstance == true
vs.
String newString = new String(oldString.getBytes("UTF-8"), "UTF-8"));
 // isSameInstance == false (in most cases)    
boolean isSameInstance = newString == oldString;
a_horse_with_no_name (see comment) is right of course. The equivalent of
String newString = new String(oldString.getBytes("UTF-8"), "UTF-8"));
is
String newString = new String(oldString);
minus the subtle difference wrt the encoding that Peter Lawrey explains in his answer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With