For legacy reasons, we have a VARCHAR2 column in our Oracle 10 database—where the character encoding is set to AL32UTF8
—that contain some non-UTF-8 values. The values are always in one of these character sets:
I've written a Perl function to fix broken values outside the database. For a value from this database column, it loops through this list of encodings and tries to convert the value to UTF-8. If the conversion fails, it tries the next encoding. The first one to convert without error is the value we keep. Now, I would like to replicate this functionality inside the database so that anyone can use it.
However, all I can find for this is the CONVERT
function, which never fails, but inserts a replacement character for characters it does not recognize. So there is no way, as far as I can tell, to know when the conversion failed.
Therefor, I have two questions:
UPDATE:
For reference, I have written this PostgreSQL function in PL/pgSQL that does exactly what I need:
CREATE OR REPLACE FUNCTION encoding_utf8(
bytea
) RETURNS TEXT LANGUAGE PLPGSQL STRICT IMMUTABLE AS $$
DECLARE
encoding TEXT;
BEGIN
FOREACH encoding IN ARRAY ARRAY[
'UTF8',
'WIN1252',
'LATIN1'
] LOOP
BEGIN
RETURN convert_from($1, encoding);
EXCEPTION WHEN character_not_in_repertoire OR untranslatable_character THEN
CONTINUE;
END;
END LOOP;
END;
$$;
I'd dearly love to know how to do the equivalent in Oracle.
As far as these two character sets go in Oracle, the only difference between AL32UTF8 and UTF8 character sets is that AL32UTF8 stores characters beyond U+FFFF as four bytes (exactly as Unicode defines UTF-8).
So if NLS_LENGTH_SEMANTICS = byte, you may only be able to store 5 characters in your varchar2. So varchar2(10 char) is explicit. This can store up to 10 characters.
The maximum length for VARCHAR2 is 32672 BYTE or 8168 CHAR which is the same as the maximum length for VARCHAR of 32672 OCTETS or 8168 CODEUNITS32.
When you specify varchar2(10), you are telling the DB that only 10 bytes of data will be stored. But, when you say nVarchar2(10), it means 10 characters will be stored. In this case, you don't have to worry about the number of bytes each character takes.
Thanks to the key information about the illegal characters in UTF-8 from @collapsar, as well as some digging by a co-worker, I've come up with this:
CREATE OR REPLACE FUNCTION reencode(string IN VARCHAR2) RETURN VARCHAR2
AS
encoded VARCHAR2(32767);
type array_t IS varray(3) OF VARCHAR2(15);
array array_t := array_t('AL32UTF8', 'WE8MSWIN1252', 'WE8ISO8859P1');
BEGIN
FOR I IN 1..array.count LOOP
encoded := CASE array(i)
WHEN 'AL32UTF8' THEN string
ELSE CONVERT(string, 'AL32UTF8', array(i))
END;
IF instr(
rawtohex(
utl_raw.cast_to_raw(
utl_i18n.raw_to_char(utl_raw.cast_to_raw(encoded), 'utf8')
)
),
'EFBFBD'
) = 0 THEN
RETURN encoded;
END IF;
END LOOP;
RAISE VALUE_ERROR;
END;
Curiously, it never gets to WE8ISO8859P1: WE8MSWIN1252 converts every single one of the list of 800 or so bad values I have without complaint. The same is not true for my Perl or PostgreSQL implementations, where CP1252 fails for some values but ISO-8859-1 succeeds. Still, the values from Oracle seem adequate, and appear to be valid Unicode (tested by loading them into PostgreSQL), so I can't complain. This will be good enough to sanitize my data, I think.
to check whether your database column contains invalid utf-8 use the following query:
select CASE
INSTR (
RAWTOHEX (
utl_raw.cast_to_raw (
utl_i18n.raw_to_char (
utl_raw.cast_to_raw ( <your_column> )
, 'utf8'
)
)
)
, 'EFBFBD'
)
WHEN 0 THEN 'OK'
ELSE 'FAIL'
END
from <your_table>
;
given that your db charset is al32utf8.
note that EF BF BD
represents an illegal encoding in utf-8.
as all the other charsets you indicate are byte-oriented, transformation to unicode will never fail but possibly produce different code points. without contextual information automated determination of the actual source charset won't be possible.
best regards, carsten
ps:
oracle names for charsets:
CP1252
-> WE8MSWIN1252
LATIN-1
-> WE8ISO8859P1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With