Understanding character encoding in typical Java web app

Tags:

Some pseudocode:

String a = "A bunch of text"; //UTF-16
saveTextInDb(a); //Write to Oracle VARCHAR(15) column
String b = readTextFromDb(); //UTF-16
out.write(b); //Write to http response

When you save the Java String (UTF-16) to Oracle VARCHAR(15) does Oracle also store this as UTF-16? Does the length of an Oracle VARCHAR refer to number of Unicode characters (and not number of bytes)?

When we write b to the ServletResponse is this being written as UTF-16 or are we by default converting to another encoding like UTF-8?

256

asked Mar 28 '10 20:03

Marcus Leon

2 Answers

Instead of UTF-16, think of 'internal representation' of your string. A string in Java is some sort of characters, you don't care which encoding is used internally. Encoding becomes relevant, if you interact with the outside of the program. In your example saveTextInDb, readTextFromDb and write do that. Every time you exchange strings with the outside, an encoding for conversion is used. saveTextInDb (and read) look like self-made methods, at least I don't know them. So you should look up, which encoding is used for this methods. The method write of a Writer always creates bytes, that represent an encoding associated with the writer. If you get your Writer from a HttpServletResponse, the encoding associated is the one used for outputting the response (that will be send in the headers).

response.setEncoding("UTF-8");
Writer out = response.getWriter();

This code returns with out a Writer, that translates the strings into UTF-8-encoding. Similar if you write to a file:

Writer fileout = new OutputStreamWriter(new FileOutputStream(myfile), "ISO8859-1");

If you access a DB, the framework you use should ensure a consistent exchange of strings with the database.

answered Sep 23 '22 19:09

Dishayloo

The ability of Oracle to store (and later retrieve) Unicode text from the database relies only on the character set of the database (usually specified during database creation). Choosing AL32UTF8 as the character set is recommended for storage of Unicode text in CHAR datatypes (including VARCHAR/VARCHAR2), for it will enable you to access all of the Unicode codepoints while not consuming a lot of storage space compared to other encodings like AL16UTF16/AL32UTF32.

Assuming this is done, it is the Oracle JDBC driver that is responsible for conversion of UTF-16 encoded data into AL32UTF8. This "automatic" conversion between encodings also happens when data is read from the database. To answer the query on byte length of VARCHAR, the definition of a VARCHAR2 column in Oracle involves byte semantics - VARCHAR2(n) is used to define a column that can store n bytes (this is the default behavior, as specified by the NLS_LENGTH_SEMANTICS parameter of the database); if you need to define the size based on characters VARCHAR2(n CHAR) is to be used.

The encoding of the data written to the ServletResponse object, depends on the default character encoding, unless this is specified via the ServletResponse.setCharacterEncoding() or ServletResponse.setContentType() API calls. All in all, for a complete Unicode solution involving an Oracle database, one must have knowledge of

The encoding of the incoming data (i.e. the encoding of the data read via the ServletRequest object). This can be done via specifying the accepted encoding in the HTML forms via the accept-charset attribute. If the encoding is unknown, the application could attempt to set it to a known value via the ServletRequest.setCharacterEncoding() method. This method doesn't change the existing encoding of characters in the stream. If the input stream is in ISO-Latin1, specifying a different encoding will most likely result in an exception being thrown. Knowing the encoding is important, since the Java runtime libraries will require knowledge of the original encoding of the stream, if the contents of the stream are to be treated as character primitives or Strings. Apparently, this is required when you invoke ServletRequest.getParameter or similar methods that will process the stream and return String objects. The decoding process will result in creation of characters in the platform encoding (this is UTF-16).
The encoding of the data read from streams, as opposed to data created with in the JVM. This is quite important, since the encoding of data read from streams, cannot be changed. There is however, a decoding process that will convert characters in supported encodings to UTF-16 characters, whenever such data is accessed as a character primitive or as a String. New String objects on the other hand, can be created with a defined encoding. This matters when you write the contents of the stream out onto another stream (the HttpServletResponse object's output stream for instance). If the contents of the input stream are being treated as a sequence of bytes, and not as characters or Strings, then no decoding operation will be undertaken by the JVM. This would imply that the bytes written to the output stream must not be altered if intermediate character or String objects are not created. Otherwise, it is quite possible that the contents of the output stream will be malformed and parsed incorrectly by a corresponding decoder. In simpler words,
- if one is writing String objects or characters to the servlet's output stream, then one must specify the encoding that the browser must use to decode the response. Appropriate encoders might be used to encode the sequence of characters as specified in the desired response.
- if one is writing a sequence of bytes that will be interpreted as characters, then the encoding to be specified in the HTTP header must be known before hand
- if one is writing a sequence of bytes that will be parsed as a sequence of bytes (for images and other binary data), then the concept of encoding is immaterial.
The database character set of the Oracle instance. As indicated previously, data will be stored in the Oracle database, in the defined character set (for CHAR datatypes). The Oracle JDBC driver takes care of conversion of data between UTF-16 and AL32UTF8 (the database character set in this case) for CHAR and NCHAR datatypes. When you invoke resultSet.getString(), a String with UTF-16 characters is being returned by the JDBC driver. The converse is true, when you send data to the database too. If another database character set is used, an additional level of conversion (from the UTF-16 to UTF-8 to the database character set) is performed transparently by the JDBC driver.

answered Sep 22 '22 19:09

Vineet Reynolds

Related questions
                            
                                Upload to imgur java
                            
                                How to clear BufferedReader in java
                            
                                How to define person's names in text (Java)
                            
                                Difference between >>> and >> operators [duplicate]
                            
                                How to create WSDL file given SOAP WSDL operations
                            
                                bit pattern(variable value) for boolean in java?
                            
                                Android never receives UDP packet
                            
                                Opening file from Java
                            
                                Spring application root variable
                            
                                Controlling the preferred size of a JEditorPane with long text
                            
                                Java app makes screen display unresponsive after 10 minutes of user idle time
                            
                                Java: byte[] to Byte[]
                            
                                Running JUnit Tests on a Restlet Router
                            
                                Questions regarding ordering of catch statements in catch block - compiler specific or language standard?
                            
                                What is unchecked and unsafe operation here?
                            
                                Java Generics Type Safety warning with recursive Hashmap
                            
                                How to configure Classpath in Websphere application server?
                            
                                Open source gravatar-like implementations? [closed]
                            
                                Hibernate mapping - "Could not determine type"
                            
                                Trying to create spring project with maven

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Understanding character encoding in typical Java web app

Tags:

java

character-encoding

unicode

oracle

Marcus Leon

People also ask

2 Answers

Dishayloo

Vineet Reynolds

Recent Activity

Donate For Us