Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unicode and performance

I am in the process of migrating a large scale web service to be compatible with international characters. It is a Tomcat/Spring MVC/SQL Server stack. The migration itself was relatively straight forward, we made a few settings changes in Tomcat to force default use of UTF-8 in the response, changed some Java code to use the encoding and migated a few VARCHAR columns to NVARCHAR followed by a healthy dose of unit/functional tests.

Another person on my team wants a load test now to make sure none of the changes adversely affect the system performance. The individual components of this transition described above don't really hint at any performance changes and, frankly, I don't think it's completely necessary based on my limited knowledge. I plan to do it anyway, but my question, then, is this - are there any performance gotchas one might see in such a migration? Is there anything specific to a different character encoding that might change the performance of a system?

The only thing I could think of would be heavy string comparison and sorting, etc. Any ideas?

like image 930
dfb Avatar asked Jan 27 '26 12:01

dfb


1 Answers

You should consider upgrading to SQL Server 2008 R2 because it offers Unicode Compression:

Unicode compression in SQL Server 2008 R2 uses an implementation of the Standard Compression Scheme for Unicode (SCSU) algorithm to compress Unicode values that are stored in row or page compressed objects. For these compressed objects, Unicode compression is automatic for nchar(n) and nvarchar(n) columns. The SQL Server Database Engine stores Unicode data as 2 bytes, regardless of locale. This is known as UCS-2 encoding. For some locales, the implementation of SCSU compression in SQL Server 2008 R2 can save up to 50 percent in storage space.

The greatest gotcha you going to encounter is Data Type Precedence rules. Because NVARCHAR has a higher precedence than VARCHAR any expression that mixes the two will be coerced to NVARCHAR. In practical terms that means that a join condition between column A and column B that before was between two VARCHAR columns and resulted in an index seek now it will be between CAST(A as NVARCHAR) and B (consider we changed only B to NVARCHAR) and this is no longer SARGable (will cause table scan). This problem can appear in joins, in WHERE clauses, in parameter types and many other places. It needs to be carefully considered, the performance degradation that results is tremendous (full scan vs. seek).

like image 101
Remus Rusanu Avatar answered Jan 30 '26 03:01

Remus Rusanu



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!