Some pseudocode:
String a = "A bunch of text"; //UTF-16
saveTextInDb(a); //Write to Oracle VARCHAR(15) column
String b = readTextFromDb(); //UTF-16
out.write(b); //Write to http response
When you save the Java String
(UTF-16) to Oracle VARCHAR(15) does Oracle also store this as UTF-16? Does the length of an Oracle VARCHAR refer to number of Unicode characters (and not number of bytes)?
When we write b
to the ServletResponse
is this being written as UTF-16 or are we by default converting to another encoding like UTF-8?
The native character encoding of the Java programming language is UTF-16. A charset in the Java platform therefore defines a mapping between sequences of sixteen-bit UTF-16 code units (that is, sequences of chars) and sequences of bytes.
In computers, encoding is the process of putting a sequence of characters (letters, numbers, punctuation, and certain symbols) into a specialized format for efficient transmission or storage. Decoding is the opposite process -- the conversion of an encoded format back into the original sequence of characters.
UTF-8 is a multibyte encoding that can represent any Unicode character. ISO 8859-1 is a single-byte encoding that can represent the first 256 Unicode characters. Both encode ASCII exactly the same way.
Answer: The most common ones being windows 1252 and Latin-1 (ISO-8859).
Instead of UTF-16, think of 'internal representation' of your string. A string in Java is some sort of characters, you don't care which encoding is used internally. Encoding becomes relevant, if you interact with the outside of the program. In your example saveTextInDb, readTextFromDb and write do that. Every time you exchange strings with the outside, an encoding for conversion is used. saveTextInDb (and read) look like self-made methods, at least I don't know them. So you should look up, which encoding is used for this methods. The method write of a Writer always creates bytes, that represent an encoding associated with the writer. If you get your Writer from a HttpServletResponse, the encoding associated is the one used for outputting the response (that will be send in the headers).
response.setEncoding("UTF-8");
Writer out = response.getWriter();
This code returns with out a Writer, that translates the strings into UTF-8-encoding. Similar if you write to a file:
Writer fileout = new OutputStreamWriter(new FileOutputStream(myfile), "ISO8859-1");
If you access a DB, the framework you use should ensure a consistent exchange of strings with the database.
The ability of Oracle to store (and later retrieve) Unicode text from the database relies only on the character set of the database (usually specified during database creation). Choosing AL32UTF8 as the character set is recommended for storage of Unicode text in CHAR datatypes (including VARCHAR/VARCHAR2), for it will enable you to access all of the Unicode codepoints while not consuming a lot of storage space compared to other encodings like AL16UTF16/AL32UTF32.
Assuming this is done, it is the Oracle JDBC driver that is responsible for conversion of UTF-16 encoded data into AL32UTF8. This "automatic" conversion between encodings also happens when data is read from the database. To answer the query on byte length of VARCHAR, the definition of a VARCHAR2 column in Oracle involves byte semantics - VARCHAR2(n) is used to define a column that can store n bytes (this is the default behavior, as specified by the NLS_LENGTH_SEMANTICS parameter of the database); if you need to define the size based on characters VARCHAR2(n CHAR) is to be used.
The encoding of the data written to the ServletResponse object, depends on the default character encoding, unless this is specified via the ServletResponse.setCharacterEncoding() or ServletResponse.setContentType() API calls. All in all, for a complete Unicode solution involving an Oracle database, one must have knowledge of
ServletRequest.getParameter
or similar methods that will process the stream and return String objects. The decoding process will result in creation of characters in the platform encoding (this is UTF-16).The encoding of the data read from streams, as opposed to data created with in the JVM. This is quite important, since the encoding of data read from streams, cannot be changed. There is however, a decoding process that will convert characters in supported encodings to UTF-16 characters, whenever such data is accessed as a character primitive or as a String. New String objects on the other hand, can be created with a defined encoding. This matters when you write the contents of the stream out onto another stream (the HttpServletResponse object's output stream for instance). If the contents of the input stream are being treated as a sequence of bytes, and not as characters or Strings, then no decoding operation will be undertaken by the JVM. This would imply that the bytes written to the output stream must not be altered if intermediate character or String objects are not created. Otherwise, it is quite possible that the contents of the output stream will be malformed and parsed incorrectly by a corresponding decoder. In simpler words,
resultSet.getString()
, a String with UTF-16 characters is being returned by the JDBC driver. The converse is true, when you send data to the database too. If another database character set is used, an additional level of conversion (from the UTF-16 to UTF-8 to the database character set) is performed transparently by the JDBC driver.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With