For legacy reasons, we have a VARCHAR2 column in our Oracle 10 database—where the character encoding is set to <code>AL32UTF8</code>—that contain some non-UTF-8 values. The values are always in one of these character sets: <ul> <li>US-ASCII</li> <li>UTF-8</li> <li>CP1252</li> <li>Latin-1</li> </ul> I've written a Perl function to fix broken values outside the database. For a value from this database column, it loops through this list of encodings and tries to convert the value to UTF-8. If the conversion fails, it tries the next encoding. The first one to convert without error is the value we keep. Now, I would like to replicate this functionality inside the database so that anyone can use it. However, all I can find for this is the <code>CONVERT</code> function, which never fails, but inserts a replacement character for characters it does not recognize. So there is no way, as far as I can tell, to know when the conversion failed. Therefor, I have two questions: <ol> <li>Is there some existing interface that tries to convert a string into one of list of encodings, returning the first that succeeds?</li> <li>And if not, is there some other interface that indicates failure if it's not able to convert a string to an encoding? If so, then I could write the previous function.</li> </ol> <hr> UPDATE: For reference, I have written this PostgreSQL function in PL/pgSQL that does exactly what I need: <pre class="prettyprint"><code>CREATE OR REPLACE FUNCTION encoding_utf8( bytea ) RETURNS TEXT LANGUAGE PLPGSQL STRICT IMMUTABLE AS $$ DECLARE encoding TEXT; BEGIN FOREACH encoding IN ARRAY ARRAY[ 'UTF8', 'WIN1252', 'LATIN1' ] LOOP BEGIN RETURN convert_from($1, encoding); EXCEPTION WHEN character_not_in_repertoire OR untranslatable_character THEN CONTINUE; END; END LOOP; END; $$; </code></pre> I'd dearly love to know how to do the equivalent in Oracle.

to check whether your database column contains invalid utf-8 use the following query: <pre class="prettyprint"><code> select CASE INSTR ( RAWTOHEX ( utl_raw.cast_to_raw ( utl_i18n.raw_to_char ( utl_raw.cast_to_raw ( <your_column> ) , 'utf8' ) ) ) , 'EFBFBD' ) WHEN 0 THEN 'OK' ELSE 'FAIL' END from <your_table> ; </code></pre> given that your db charset is al32utf8. note that <code>EF BF BD</code> represents an illegal encoding in utf-8. as all the other charsets you indicate are byte-oriented, transformation to unicode will never fail but possibly produce different code points. without contextual information automated determination of the actual source charset won't be possible. best regards, carsten ps: oracle names for charsets: <code>CP1252</code> -> <code>WE8MSWIN1252</code> <code>LATIN-1</code> -> <code>WE8ISO8859P1</code>

How can I convert Oracle VARCHAR2 values to UTF-8 from a list of possible encodings?

Tags:

character-encoding

unicode

utf-8

oracle

For legacy reasons, we have a VARCHAR2 column in our Oracle 10 database—where the character encoding is set to AL32UTF8—that contain some non-UTF-8 values. The values are always in one of these character sets:

US-ASCII
UTF-8
CP1252
Latin-1

I've written a Perl function to fix broken values outside the database. For a value from this database column, it loops through this list of encodings and tries to convert the value to UTF-8. If the conversion fails, it tries the next encoding. The first one to convert without error is the value we keep. Now, I would like to replicate this functionality inside the database so that anyone can use it.

However, all I can find for this is the CONVERT function, which never fails, but inserts a replacement character for characters it does not recognize. So there is no way, as far as I can tell, to know when the conversion failed.

Therefor, I have two questions:

Is there some existing interface that tries to convert a string into one of list of encodings, returning the first that succeeds?
And if not, is there some other interface that indicates failure if it's not able to convert a string to an encoding? If so, then I could write the previous function.

UPDATE:

For reference, I have written this PostgreSQL function in PL/pgSQL that does exactly what I need:

CREATE OR REPLACE FUNCTION encoding_utf8(
    bytea
) RETURNS TEXT LANGUAGE PLPGSQL STRICT IMMUTABLE AS $$
DECLARE
    encoding TEXT;
BEGIN
    FOREACH encoding IN ARRAY ARRAY[
        'UTF8',
        'WIN1252',
        'LATIN1'
    ] LOOP
        BEGIN
            RETURN convert_from($1, encoding);
        EXCEPTION WHEN character_not_in_repertoire OR untranslatable_character THEN
            CONTINUE;
        END;
    END LOOP;
END;
$$;

I'd dearly love to know how to do the equivalent in Oracle.

624

asked Oct 03 '12 21:10

theory

2 Answers

Thanks to the key information about the illegal characters in UTF-8 from @collapsar, as well as some digging by a co-worker, I've come up with this:

CREATE OR REPLACE FUNCTION reencode(string IN VARCHAR2) RETURN VARCHAR2
AS
    encoded VARCHAR2(32767);
    type  array_t IS varray(3) OF VARCHAR2(15);
    array array_t := array_t('AL32UTF8', 'WE8MSWIN1252', 'WE8ISO8859P1');
BEGIN
    FOR I IN 1..array.count LOOP
        encoded := CASE array(i)
            WHEN 'AL32UTF8' THEN string
            ELSE CONVERT(string, 'AL32UTF8', array(i))
        END;
        IF instr(
            rawtohex(
                utl_raw.cast_to_raw(
                    utl_i18n.raw_to_char(utl_raw.cast_to_raw(encoded), 'utf8')
                )
            ),
            'EFBFBD'
        ) = 0 THEN
            RETURN encoded;
        END IF;
    END LOOP;
    RAISE VALUE_ERROR;
END;

Curiously, it never gets to WE8ISO8859P1: WE8MSWIN1252 converts every single one of the list of 800 or so bad values I have without complaint. The same is not true for my Perl or PostgreSQL implementations, where CP1252 fails for some values but ISO-8859-1 succeeds. Still, the values from Oracle seem adequate, and appear to be valid Unicode (tested by loading them into PostgreSQL), so I can't complain. This will be good enough to sanitize my data, I think.

answered Sep 17 '22 21:09

theory

to check whether your database column contains invalid utf-8 use the following query:

 select CASE
            INSTR (
                  RAWTOHEX (
                      utl_raw.cast_to_raw (
                          utl_i18n.raw_to_char (
                                utl_raw.cast_to_raw ( <your_column> )
                              , 'utf8'
                          )
                      )
                  )
                , 'EFBFBD'
            )
        WHEN 0 THEN 'OK'
        ELSE 'FAIL' 
        END
   from <your_table>
      ;

given that your db charset is al32utf8.

note that EF BF BD represents an illegal encoding in utf-8.

as all the other charsets you indicate are byte-oriented, transformation to unicode will never fail but possibly produce different code points. without contextual information automated determination of the actual source charset won't be possible.

best regards, carsten

ps: oracle names for charsets: CP1252 -> WE8MSWIN1252 LATIN-1 -> WE8ISO8859P1

answered Sep 21 '22 21:09

collapsar

Related questions
                            
                                SQL Query to Count() multiple tables
                            
                                Access and Filter predicates in Oracle execution plan
                            
                                Relationship between CURSOR_SHARING, Bind Variable Peeking and Histograms
                            
                                What is the efficient way to store UUID generated in Java to Oracle DB? Will removing or not removing hyphen in the UUID be any use?
                            
                                PL/SQL: Error "PLS-00306: wrong number or types of arguments in call to" triggered for table of numbers
                            
                                SQL-Developer: cant open the program
                            
                                difference between NLS_NCHAR_CHARACTERSET and NLS_CHARACTERSET for Oracle
                            
                                How to limit the results to 1 row in Oracle SQL [duplicate]
                            
                                How can I implement conditional updating in Oracle?
                            
                                How to Install DBD::Oracle in Strawberry Perl
                            
                                SQL: Oracle - Parameters in query
                            
                                Convert byte array from Oracle RAW to System.Guid?
                            
                                Truncate date to fiscal year
                            
                                Max column width in Oracle spool to file
                            
                                Is UPDATE = DELETE(marked as) + INSERT?
                            
                                Oracle sql developer color coding
                            
                                Oracle 10g small Blob or Clob not being stored inline?
                            
                                How to create a tablespace with multiple datafiles?
                            
                                Oracle inequality operator: ¬=
                            
                                when should I go for procedure or function in PL/SQL?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I convert Oracle VARCHAR2 values to UTF-8 from a list of possible encodings?

Tags:

character-encoding

unicode

utf-8

oracle

theory

People also ask

2 Answers

theory

collapsar

Recent Activity

Donate For Us