Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Disadvantages of setting the whole DB charset to Unicode vs some NVARCHAR2 in Oracle DB?

Some columns of my tables need to support Unicode characters (Let's say 1% of all my columns).

I think I have the following two options:

  1. Implement the Unicode columns as NVARCHAR2; or
  2. Change the character set of the whole database to one that supports Unicode (this way I can use VARCHAR2).

I'm inclined to the second option (in order to not have to change my already existing VARCHAR2 scripts).

My question is: what are the disadvantages and advantages of this second option, when compared with the first one? Is it less performant?

like image 308
John Assymptoth Avatar asked Oct 09 '12 13:10

John Assymptoth


1 Answers

I would strongly lean toward changing the character set of the database.

There are potential downsides of doing so

  • If you are storing data that is not in the 7-bit ASCII character set in your other columns, you will increase the amount of space that is required to store the data. Assuming that your existing character set is one of the 8-bit character sets that allow you to store English plus a few other languages, any non-English characters in the data will generally require 2 or more bytes of storage per character. For example, if you are storing the character "h", that's an English character that is part of the 7-bit ASCII character set so it would require 1 byte in either your single-byte character set or your Unicode character set. If you are storing the character "À", on the other hand, that is not English and not part of the 7-bit ASCII character set so it would require 2 bytes of storage in a Unicode character set vs. 1 byte in your existing single-byte character set. Other characters will require 3 bytes of storage.
  • You'll have to care about character vs. byte semantics when you declare a VARCHAR2. By default, VARCHAR2(50) allocates 50 bytes of storage which will allow you to store between 16 and 50 characters if you're using an AL32UTF8 character set rather than being a simple 1:1 mapping like it is if you're using a single-byte character set. That will require either that you increase the size of your columns (i.e. triple them) to ensure that they store the appropriate number of characters or that you specify character-length semantics when you declare the column (i.e. VARCHAR2(50 CHAR)) or that you set your NLS_LENGTH_SEMANTICS to CHAR before creating the objects in order to change the default to character-length semantics. There is a discussion on the Oracle Globalization forum about whether it is appropriate to change the NLS_LENGTH_SEMANTICS at the instance level-- Sergiusz Wolicki, one of Oracle's top globalization gurus suggests strongly against it though I am personally much more willing to consider it under the right circumstances. You can also set the NLS_LENGTH_SEMANTICS at the session level which is something Sergiusz does not object to but does require that you do so every time you run a script which may be a concern.
  • Most tools don't deal particularly well with queries against the data dictionary where character semantics were used to create the column. They don't properly use the CHAR_LENGTH and DATA_LENGTH columns where they want the length in characters vs. the length in bytes. This may be a minor issue for you or a serious pain if you have existing tools/ scripts/ etc. that run queries against the data dictionary to generate DDL or to determine how much memory needs to be allocated or some other situation where you end up getting funky results.

However, those downsides are more than outweighed by the advantages of having a single character set for all your data

  • Handling NVARCHAR2 columns generally requires application code changes. Since you're going to have both VARCHAR2 and NVARCHAR2 columns, those code changes and configuration settings can be non-trivial and are often a major annoyance. Inevitably, you'll find that you mapped a particular column incorrectly in some application and you'll encounter data corruption bugs that are a pain to track down. This is more true the more layers of abstraction you have between your database and your application.
  • If 1% of the columns need to support Unicode today, it is inevitably the case that more columns will need to support Unicode tomorrow. As additional requirements are added, changing the data type of a column from VARCHAR2 to NVARCHAR2 is a pain-- you'll need to add a new column, copy the data over, and remove the old column, rename the new column, and deal with the row migration that resulted. You'll then have to make changes to all your existing applications so that they map the column correctly. When the business decides that one more column needs to support additional languages and your database and applications already support Unicode, that level of effort and testing will seem rather excessive.
  • SQL statements must be encoded in the database character set. This tends to create issues if you ever want to use data from an NVARCHAR2 column as a literal in a SQL statement either in an application (say, to avoid bind variable peeking or to better leverage a histogram) or as part of production support when you want to track down issues in the data.
  • Unicode character sets are the direction Oracle strongly encourages and the use of NVARCHAR2 columns is strongly discouraged. This probably doesn't have immediate practical consequences but if your system is supposed to be around for a number of years, it's likely that there will be consequences in the future.

Sergiusz sums up Oracle's advice very well in this thread

Oracle's advice:

  • For any new database, create it with the AL32UTF8 character set and forget about NCHAR data types.
  • For any existing application to be made multilingual, migrate the backend database to AL32UTF8 and forget about NCHAR data types.
  • For any existing non-Unicode database serving a large legacy application system that is too costly or impossible to migrate to Unicode, to which you are asked to add a minor module that has to support multilingual data and for which a separate database makes little sense, you may consider NVARCHAR2 columns for this multilingual data.
like image 128
Justin Cave Avatar answered Sep 28 '22 09:09

Justin Cave