Weird results with character encodings

Question

Here is the scenario-

The DB2 database is on mainframe system (z/OS)
The web server runs on the USS (Unix part of z/OS), running Java code with Spring JDBC
The browser we tested with and the client program run on Windows 7(default encoding is windows-1252)

We have a string which contains a Spanish character (ú), it is stored in the database using Spring's JDBCTemplate, so essentially JDBC.

When queried with a JDBC client (Squirrel, written in Java), it shows up as something else (Ãº).
When queried with a sample JDBC program and printed the result as a string, it shows up as something else (Ãº).
When queried with a sample JDBC program and printed the result as a UTF-8 encoded string [new String(str, "UTF-8")], it shows up correctly (ú).
When starting the JVM with UTF-8 encoding using this -Dfile.encoding=utf-8, the result is printed as something else(ÃƒÂº) in both the above cases.
The browser running the front end of the application also shows it as Ãº, the content header of the HTML is set to UTF-8 though.

At this stage I am a bit confused and have these questions-

If printing the string in UTF-8 format specifically works, why doesn't it work when the JVM is started with UTF-8 encoding.
At which layer the problem could actually be happening, the database or the JVM?

What should I be doing it to solve at the application level rather at column level?

Any pointers would be helpful.

Daniel Martin · Accepted Answer

The effects you're seeing can all be explained by the assumption that data is written to the database as UTF-8 bytes, but that the database believes that those bytes are some other character set (either ISO-LATIN-1 or Windows-1252), and then when you read the data, the string you get back is those bytes interpreted as ISO-LATIN-1 or a related character set.

The character ú in UTF-8 is the two bytes 0xC3 0xBA. When those bytes are interpreted as ISO-LATIN-1 or win-1252, you get the two characters Ãº.

The two characters Ãº when written in UTF-8 are the four bytes 0xC3 0x83 0xC2 0xBA. When those four bytes are interpreted as ISO-LATIN-1, (or win-1252) you get the four characters ÃƒÂº.

(Windows-1252 and ISO-LATIN-1 happen to agree on all the bytes/characters in question, so from the evidence I can't tell the difference between them)

What's happening to you, I believe, is this:

The JDBC clients are querying your database and are getting back a string containing the two characters Ãº from the database.
When the JVM prints a result to the windows 7 console box, if it is not started with -Dfile.encoding=utf-8, it sends to the console box the bytes needed to represent the string in win-1252. If the JVM is started with that option, it sends to the console box the bytes necessary to represent the string in UTF-8.
Your windows 7 console box is set to windows-1252, and displays what java prints out by interpreting the bytes java sends it according to windows-1252
When you call .getBytes() with no argument, you are using the JVM's default encoding to turn the string into bytes. Therefore, new String(str.getBytes(), "UTF-8") will result in an identical string if the default JVM encoding is UTF-8, and can only result in something actually happening if the default encoding is something different than UTF-8.

This explains all the evidence you presented: the java string retrieved by JDBC contains the characters Ãº, and then when a non-utf-8 JVM tries to print this to the console box, this is printed as Ãº. When a utf-8 JVM tries to print this string to the console box, it prints the four byte 0xC3 0x83 0xC2 0xBA, and the console interprets that as the four characters ÃƒÂº. When a java web server tries to send this string back to the browser, it does so - what the browser sees is what the java application received out of JDBC.

The first thing to check is that the Spring JDBCTemplate is receiving the data correctly and writing to the database correctly. Can you get Spring to log what it receives from the browser somewhere, and ensure that the browser is sending UTF-8, and that Spring knows that the browser is sending UTF-8? (one thing you might want to check there is log what strings were received and how long the strings were in each field. That can let you know if things are being interpreted correctly as UTF-8)

Assuming that data is getting into the database correctly, and as you say that you can't make a change on the database side, and want a change purely from the application side, you can do this to every string received from JDBC:

new String(str.getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.UTF_8)

That should transform your string back to what you want, regardless of what the JVM's default encoding is.

For future reference, running a jvm from the windows command line with -Dfile.encoding=utf-8 usually requires changing the codepage on your console first in order to see stuff correctly. (That can be done with the command chcp 65001. Just remember to use chcp 1252 to change back before running a JVM command without that option set)

Weird results with character encodings

Tags:

java

encoding

utf-8

User2709

1 Answers

Daniel Martin

Recent Activity

Donate For Us

Weird results with character encodings

Tags:

java

encoding

utf-8

User2709

1 Answers

Daniel Martin

Related questions

Recent Activity

Donate For Us