Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding character encoding in typical Java web app

Some pseudocode:

String a = "A bunch of text"; //UTF-16
saveTextInDb(a); //Write to Oracle VARCHAR(15) column
String b = readTextFromDb(); //UTF-16
out.write(b); //Write to http response

When you save the Java String (UTF-16) to Oracle VARCHAR(15) does Oracle also store this as UTF-16? Does the length of an Oracle VARCHAR refer to number of Unicode characters (and not number of bytes)?

When we write b to the ServletResponse is this being written as UTF-16 or are we by default converting to another encoding like UTF-8?

like image 256
Marcus Leon Avatar asked Mar 28 '10 20:03

Marcus Leon


People also ask

What character encoding is used in Java?

The native character encoding of the Java programming language is UTF-16. A charset in the Java platform therefore defines a mapping between sequences of sixteen-bit UTF-16 code units (that is, sequences of chars) and sequences of bytes.

What is encoding in web application?

In computers, encoding is the process of putting a sequence of characters (letters, numbers, punctuation, and certain symbols) into a specialized format for efficient transmission or storage. Decoding is the opposite process -- the conversion of an encoded format back into the original sequence of characters.

What is the difference between ISO-8859-1 and UTF-8?

UTF-8 is a multibyte encoding that can represent any Unicode character. ISO 8859-1 is a single-byte encoding that can represent the first 256 Unicode characters. Both encode ASCII exactly the same way.

What are the two most popular character encoding in Java?

Answer: The most common ones being windows 1252 and Latin-1 (ISO-8859).


2 Answers

Instead of UTF-16, think of 'internal representation' of your string. A string in Java is some sort of characters, you don't care which encoding is used internally. Encoding becomes relevant, if you interact with the outside of the program. In your example saveTextInDb, readTextFromDb and write do that. Every time you exchange strings with the outside, an encoding for conversion is used. saveTextInDb (and read) look like self-made methods, at least I don't know them. So you should look up, which encoding is used for this methods. The method write of a Writer always creates bytes, that represent an encoding associated with the writer. If you get your Writer from a HttpServletResponse, the encoding associated is the one used for outputting the response (that will be send in the headers).

response.setEncoding("UTF-8");
Writer out = response.getWriter();

This code returns with out a Writer, that translates the strings into UTF-8-encoding. Similar if you write to a file:

Writer fileout = new OutputStreamWriter(new FileOutputStream(myfile), "ISO8859-1");

If you access a DB, the framework you use should ensure a consistent exchange of strings with the database.

like image 92
Dishayloo Avatar answered Sep 23 '22 19:09

Dishayloo


The ability of Oracle to store (and later retrieve) Unicode text from the database relies only on the character set of the database (usually specified during database creation). Choosing AL32UTF8 as the character set is recommended for storage of Unicode text in CHAR datatypes (including VARCHAR/VARCHAR2), for it will enable you to access all of the Unicode codepoints while not consuming a lot of storage space compared to other encodings like AL16UTF16/AL32UTF32.

Assuming this is done, it is the Oracle JDBC driver that is responsible for conversion of UTF-16 encoded data into AL32UTF8. This "automatic" conversion between encodings also happens when data is read from the database. To answer the query on byte length of VARCHAR, the definition of a VARCHAR2 column in Oracle involves byte semantics - VARCHAR2(n) is used to define a column that can store n bytes (this is the default behavior, as specified by the NLS_LENGTH_SEMANTICS parameter of the database); if you need to define the size based on characters VARCHAR2(n CHAR) is to be used.

The encoding of the data written to the ServletResponse object, depends on the default character encoding, unless this is specified via the ServletResponse.setCharacterEncoding() or ServletResponse.setContentType() API calls. All in all, for a complete Unicode solution involving an Oracle database, one must have knowledge of

  1. The encoding of the incoming data (i.e. the encoding of the data read via the ServletRequest object). This can be done via specifying the accepted encoding in the HTML forms via the accept-charset attribute. If the encoding is unknown, the application could attempt to set it to a known value via the ServletRequest.setCharacterEncoding() method. This method doesn't change the existing encoding of characters in the stream. If the input stream is in ISO-Latin1, specifying a different encoding will most likely result in an exception being thrown. Knowing the encoding is important, since the Java runtime libraries will require knowledge of the original encoding of the stream, if the contents of the stream are to be treated as character primitives or Strings. Apparently, this is required when you invoke ServletRequest.getParameter or similar methods that will process the stream and return String objects. The decoding process will result in creation of characters in the platform encoding (this is UTF-16).
  2. The encoding of the data read from streams, as opposed to data created with in the JVM. This is quite important, since the encoding of data read from streams, cannot be changed. There is however, a decoding process that will convert characters in supported encodings to UTF-16 characters, whenever such data is accessed as a character primitive or as a String. New String objects on the other hand, can be created with a defined encoding. This matters when you write the contents of the stream out onto another stream (the HttpServletResponse object's output stream for instance). If the contents of the input stream are being treated as a sequence of bytes, and not as characters or Strings, then no decoding operation will be undertaken by the JVM. This would imply that the bytes written to the output stream must not be altered if intermediate character or String objects are not created. Otherwise, it is quite possible that the contents of the output stream will be malformed and parsed incorrectly by a corresponding decoder. In simpler words,

    • if one is writing String objects or characters to the servlet's output stream, then one must specify the encoding that the browser must use to decode the response. Appropriate encoders might be used to encode the sequence of characters as specified in the desired response.
    • if one is writing a sequence of bytes that will be interpreted as characters, then the encoding to be specified in the HTTP header must be known before hand
    • if one is writing a sequence of bytes that will be parsed as a sequence of bytes (for images and other binary data), then the concept of encoding is immaterial.
  3. The database character set of the Oracle instance. As indicated previously, data will be stored in the Oracle database, in the defined character set (for CHAR datatypes). The Oracle JDBC driver takes care of conversion of data between UTF-16 and AL32UTF8 (the database character set in this case) for CHAR and NCHAR datatypes. When you invoke resultSet.getString(), a String with UTF-16 characters is being returned by the JDBC driver. The converse is true, when you send data to the database too. If another database character set is used, an additional level of conversion (from the UTF-16 to UTF-8 to the database character set) is performed transparently by the JDBC driver.
like image 44
Vineet Reynolds Avatar answered Sep 22 '22 19:09

Vineet Reynolds