Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Weird results with character encodings

Here is the scenario-

  • The DB2 database is on mainframe system (z/OS)
  • The web server runs on the USS (Unix part of z/OS), running Java code with Spring JDBC
  • The browser we tested with and the client program run on Windows 7(default encoding is windows-1252)

We have a string which contains a Spanish character (ú), it is stored in the database using Spring's JDBCTemplate, so essentially JDBC.

  • When queried with a JDBC client (Squirrel, written in Java), it shows up as something else (ú).
  • When queried with a sample JDBC program and printed the result as a string, it shows up as something else (ú).
  • When queried with a sample JDBC program and printed the result as a UTF-8 encoded string [new String(str, "UTF-8")], it shows up correctly (ú).
  • When starting the JVM with UTF-8 encoding using this -Dfile.encoding=utf-8, the result is printed as something else(ú) in both the above cases.
  • The browser running the front end of the application also shows it as ú, the content header of the HTML is set to UTF-8 though.

At this stage I am a bit confused and have these questions-

  • If printing the string in UTF-8 format specifically works, why doesn't it work when the JVM is started with UTF-8 encoding.
  • At which layer the problem could actually be happening, the database or the JVM?

What should I be doing it to solve at the application level rather at column level?

Any pointers would be helpful.

like image 365
User2709 Avatar asked Dec 07 '25 23:12

User2709


1 Answers

The effects you're seeing can all be explained by the assumption that data is written to the database as UTF-8 bytes, but that the database believes that those bytes are some other character set (either ISO-LATIN-1 or Windows-1252), and then when you read the data, the string you get back is those bytes interpreted as ISO-LATIN-1 or a related character set.

The character ú in UTF-8 is the two bytes 0xC3 0xBA. When those bytes are interpreted as ISO-LATIN-1 or win-1252, you get the two characters ú.

The two characters ú when written in UTF-8 are the four bytes 0xC3 0x83 0xC2 0xBA. When those four bytes are interpreted as ISO-LATIN-1, (or win-1252) you get the four characters ú.

(Windows-1252 and ISO-LATIN-1 happen to agree on all the bytes/characters in question, so from the evidence I can't tell the difference between them)

What's happening to you, I believe, is this:

  1. The JDBC clients are querying your database and are getting back a string containing the two characters ú from the database.

  2. When the JVM prints a result to the windows 7 console box, if it is not started with -Dfile.encoding=utf-8, it sends to the console box the bytes needed to represent the string in win-1252. If the JVM is started with that option, it sends to the console box the bytes necessary to represent the string in UTF-8.

  3. Your windows 7 console box is set to windows-1252, and displays what java prints out by interpreting the bytes java sends it according to windows-1252

  4. When you call .getBytes() with no argument, you are using the JVM's default encoding to turn the string into bytes. Therefore, new String(str.getBytes(), "UTF-8") will result in an identical string if the default JVM encoding is UTF-8, and can only result in something actually happening if the default encoding is something different than UTF-8.

This explains all the evidence you presented: the java string retrieved by JDBC contains the characters ú, and then when a non-utf-8 JVM tries to print this to the console box, this is printed as ú. When a utf-8 JVM tries to print this string to the console box, it prints the four byte 0xC3 0x83 0xC2 0xBA, and the console interprets that as the four characters ú. When a java web server tries to send this string back to the browser, it does so - what the browser sees is what the java application received out of JDBC.

The first thing to check is that the Spring JDBCTemplate is receiving the data correctly and writing to the database correctly. Can you get Spring to log what it receives from the browser somewhere, and ensure that the browser is sending UTF-8, and that Spring knows that the browser is sending UTF-8? (one thing you might want to check there is log what strings were received and how long the strings were in each field. That can let you know if things are being interpreted correctly as UTF-8)

Assuming that data is getting into the database correctly, and as you say that you can't make a change on the database side, and want a change purely from the application side, you can do this to every string received from JDBC:

new String(str.getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.UTF_8)

That should transform your string back to what you want, regardless of what the JVM's default encoding is.

For future reference, running a jvm from the windows command line with -Dfile.encoding=utf-8 usually requires changing the codepage on your console first in order to see stuff correctly. (That can be done with the command chcp 65001. Just remember to use chcp 1252 to change back before running a JVM command without that option set)

like image 72
Daniel Martin Avatar answered Dec 09 '25 15:12

Daniel Martin



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!