Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I convert Oracle VARCHAR2 values to UTF-8 from a list of possible encodings?

For legacy reasons, we have a VARCHAR2 column in our Oracle 10 database—where the character encoding is set to AL32UTF8—that contain some non-UTF-8 values. The values are always in one of these character sets:

  • US-ASCII
  • UTF-8
  • CP1252
  • Latin-1

I've written a Perl function to fix broken values outside the database. For a value from this database column, it loops through this list of encodings and tries to convert the value to UTF-8. If the conversion fails, it tries the next encoding. The first one to convert without error is the value we keep. Now, I would like to replicate this functionality inside the database so that anyone can use it.

However, all I can find for this is the CONVERT function, which never fails, but inserts a replacement character for characters it does not recognize. So there is no way, as far as I can tell, to know when the conversion failed.

Therefor, I have two questions:

  1. Is there some existing interface that tries to convert a string into one of list of encodings, returning the first that succeeds?
  2. And if not, is there some other interface that indicates failure if it's not able to convert a string to an encoding? If so, then I could write the previous function.

UPDATE:

For reference, I have written this PostgreSQL function in PL/pgSQL that does exactly what I need:

CREATE OR REPLACE FUNCTION encoding_utf8(
    bytea
) RETURNS TEXT LANGUAGE PLPGSQL STRICT IMMUTABLE AS $$
DECLARE
    encoding TEXT;
BEGIN
    FOREACH encoding IN ARRAY ARRAY[
        'UTF8',
        'WIN1252',
        'LATIN1'
    ] LOOP
        BEGIN
            RETURN convert_from($1, encoding);
        EXCEPTION WHEN character_not_in_repertoire OR untranslatable_character THEN
            CONTINUE;
        END;
    END LOOP;
END;
$$;

I'd dearly love to know how to do the equivalent in Oracle.

like image 624
theory Avatar asked Oct 03 '12 21:10

theory


People also ask

What is the difference between UTF-8 and AL32UTF8?

As far as these two character sets go in Oracle, the only difference between AL32UTF8 and UTF8 character sets is that AL32UTF8 stores characters beyond U+FFFF as four bytes (exactly as Unicode defines UTF-8).

How many characters can varchar2 hold in Oracle?

So if NLS_LENGTH_SEMANTICS = byte, you may only be able to store 5 characters in your varchar2. So varchar2(10 char) is explicit. This can store up to 10 characters.

How many characters can varchar2 data type?

The maximum length for VARCHAR2 is 32672 BYTE or 8168 CHAR which is the same as the maximum length for VARCHAR of 32672 OCTETS or 8168 CODEUNITS32.

What is the difference between nVarchar2 and varchar2?

When you specify varchar2(10), you are telling the DB that only 10 bytes of data will be stored. But, when you say nVarchar2(10), it means 10 characters will be stored. In this case, you don't have to worry about the number of bytes each character takes.


2 Answers

Thanks to the key information about the illegal characters in UTF-8 from @collapsar, as well as some digging by a co-worker, I've come up with this:

CREATE OR REPLACE FUNCTION reencode(string IN VARCHAR2) RETURN VARCHAR2
AS
    encoded VARCHAR2(32767);
    type  array_t IS varray(3) OF VARCHAR2(15);
    array array_t := array_t('AL32UTF8', 'WE8MSWIN1252', 'WE8ISO8859P1');
BEGIN
    FOR I IN 1..array.count LOOP
        encoded := CASE array(i)
            WHEN 'AL32UTF8' THEN string
            ELSE CONVERT(string, 'AL32UTF8', array(i))
        END;
        IF instr(
            rawtohex(
                utl_raw.cast_to_raw(
                    utl_i18n.raw_to_char(utl_raw.cast_to_raw(encoded), 'utf8')
                )
            ),
            'EFBFBD'
        ) = 0 THEN
            RETURN encoded;
        END IF;
    END LOOP;
    RAISE VALUE_ERROR;
END;

Curiously, it never gets to WE8ISO8859P1: WE8MSWIN1252 converts every single one of the list of 800 or so bad values I have without complaint. The same is not true for my Perl or PostgreSQL implementations, where CP1252 fails for some values but ISO-8859-1 succeeds. Still, the values from Oracle seem adequate, and appear to be valid Unicode (tested by loading them into PostgreSQL), so I can't complain. This will be good enough to sanitize my data, I think.

like image 81
theory Avatar answered Sep 17 '22 21:09

theory


to check whether your database column contains invalid utf-8 use the following query:

 select CASE
            INSTR (
                  RAWTOHEX (
                      utl_raw.cast_to_raw (
                          utl_i18n.raw_to_char (
                                utl_raw.cast_to_raw ( <your_column> )
                              , 'utf8'
                          )
                      )
                  )
                , 'EFBFBD'
            )
        WHEN 0 THEN 'OK'
        ELSE 'FAIL' 
        END
   from <your_table>
      ;

given that your db charset is al32utf8.

note that EF BF BD represents an illegal encoding in utf-8.

as all the other charsets you indicate are byte-oriented, transformation to unicode will never fail but possibly produce different code points. without contextual information automated determination of the actual source charset won't be possible.

best regards, carsten

ps: oracle names for charsets: CP1252 -> WE8MSWIN1252 LATIN-1 -> WE8ISO8859P1

like image 41
collapsar Avatar answered Sep 21 '22 21:09

collapsar