Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Inconsistent string length definition between Java String.substring() and Oracle 11g column VARCHAR2 size

I set up my database with a table like this:

CREATE TABLE t_audit_log
(
  description VARCHAR2 (2500)
);

In the Java app which uses it, I employ Hibernate to map a data class onto it and to make sure that I'm not going to generate SQLExceptions, I put this truncation algorithm in the property getter:

private static final int MAX_STRING_LEN_2500 = 2499;

public void setDescription(final String newDescription) {
    if (newDescription != null
        && newDescription.length() > MAX_STRING_LEN_2500) {
        description = newDescription.substring(0, MAX_STRING_LEN_2500);
    } else {
        description = newDescription;
    }
}

For thousands of audit log entries, this worked fine - until today. I found this in the logs:

Nov 09, 2015 7:54:40 AM org.hibernate.engine.jdbc.spi.SqlExceptionHelper logExceptions
WARN: SQL Error: 12899, SQLState: 72000
Nov 09, 2015 7:54:40 AM org.hibernate.engine.jdbc.spi.SqlExceptionHelper logExceptions
ERROR: ORA-12899: value too large for column "BLABLA"."T_AUDIT_LOG"."DESCRIPTION" 
    (actual: 2501, maximum: 2500)

Why has substring() left an extra character in the value?

like image 283
Adam Avatar asked Dec 05 '22 02:12

Adam


1 Answers

I suspect your database settings are set to use "byte semantics" for the length operations (which is the default for NLS_LENGTH_SEMANTICS), in which case you're saying you want the field to be up to 2500 bytes in length when encoded, not 2500 characters. Suppose your database is using UTF-8 to encode the string - if your string has 2498 ASCII characters and 1 character of U+20A0 (the Euro symbol), that would result in a total of 2501 bytes, but only 2499 characters.

The Java length() and substring() operations will operate in terms of UTF-16 code units - which may or may not quite align with "character semantics". (It's somewhat unlikely that you'll be trying to store characters outside the Basic Multilingual Plane, which is where a single character takes two UTF-16 code units, but it's possible.)

You really need to work out what you want the field length to actually be represented in - then you can work out whether to change how you're performing the truncation in Java.

like image 104
Jon Skeet Avatar answered Dec 06 '22 19:12

Jon Skeet