Unicode and performance

Question

I am in the process of migrating a large scale web service to be compatible with international characters. It is a Tomcat/Spring MVC/SQL Server stack. The migration itself was relatively straight forward, we made a few settings changes in Tomcat to force default use of UTF-8 in the response, changed some Java code to use the encoding and migated a few VARCHAR columns to NVARCHAR followed by a healthy dose of unit/functional tests.

Another person on my team wants a load test now to make sure none of the changes adversely affect the system performance. The individual components of this transition described above don't really hint at any performance changes and, frankly, I don't think it's completely necessary based on my limited knowledge. I plan to do it anyway, but my question, then, is this - are there any performance gotchas one might see in such a migration? Is there anything specific to a different character encoding that might change the performance of a system?

The only thing I could think of would be heavy string comparison and sorting, etc. Any ideas?

Remus Rusanu · Accepted Answer

You should consider upgrading to SQL Server 2008 R2 because it offers Unicode Compression:

Unicode compression in SQL Server 2008 R2 uses an implementation of the Standard Compression Scheme for Unicode (SCSU) algorithm to compress Unicode values that are stored in row or page compressed objects. For these compressed objects, Unicode compression is automatic for nchar(n) and nvarchar(n) columns. The SQL Server Database Engine stores Unicode data as 2 bytes, regardless of locale. This is known as UCS-2 encoding. For some locales, the implementation of SCSU compression in SQL Server 2008 R2 can save up to 50 percent in storage space.

The greatest gotcha you going to encounter is Data Type Precedence rules. Because NVARCHAR has a higher precedence than VARCHAR any expression that mixes the two will be coerced to NVARCHAR. In practical terms that means that a join condition between column A and column B that before was between two VARCHAR columns and resulted in an index seek now it will be between CAST(A as NVARCHAR) and B (consider we changed only B to NVARCHAR) and this is no longer SARGable (will cause table scan). This problem can appear in joins, in WHERE clauses, in parameter types and many other places. It needs to be carefully considered, the performance degradation that results is tremendous (full scan vs. seek).

Unicode and performance

Tags:

sql-server

spring-mvc

unicode

tomcat

dfb

1 Answers

Remus Rusanu

Recent Activity

Donate For Us

Unicode and performance

Tags:

sql-server

spring-mvc

unicode

tomcat

dfb

1 Answers

Remus Rusanu

Related questions

Recent Activity

Donate For Us