Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How fix double encoding in PostgreSQL?

I have a table in PostgreSQL with words, but some words have invalid UTF-8 chars like 0xe7e36f and 0xefbfbd.

How I can identify all chars inside words that are invalid and replace they with some symbol like ??

EDIT: My database is in UTF-8, but I think there are double encoding from various other encodings. I think this because when I tried to convert to one type as LATIN1, I get an error saying that some char don't exists in that encoding, when I change to LATIN2 I get the same error, but with another character.

So, what is possible to do to solve this?

like image 998
Renato Dinhani Avatar asked Nov 18 '11 16:11

Renato Dinhani


People also ask

What encoding does PostgreSQL use?

ScaleGrid PostgreSQL deployments use UTF-8 as the default encoding on both client and server side. The template1 database is UTF-8 encoded, and uses the en_US. UTF-8 locale. By default any databases you create will also inherit this encoding.

Does PostgreSQL support UTF 16?

All supported character sets can be used by clients. However, some client-side only characters are not supported for use within the server. Unlike Oracle, PostgreSQL doesn't support an NVARHCHAR data type and doesn't offer support for UTF-16.

How do I change collate and Ctype in PostgreSQL?

You cannot to change these values for already created databases. In this moment, when there are not other databases, the most easy solution is a) stop database, b) delete data directory, c) run manually initdb with options --encoding and --locale (run this command under postgres user).


1 Answers

Usage

It's a solution for my specific case, but maybe with some modifications can help another people.

Usage

SELECT fix_wrong_encoding('LATIN1');

Function

-- Convert words with wrong encoding
CREATE OR REPLACE FUNCTION fix_wrong_encoding(encoding_name VARCHAR)
RETURNS VOID
AS $$
DECLARE     
    r RECORD;
    counter INTEGER;
    token_id INTEGER;
BEGIN
    counter = 0;
    FOR r IN SELECT t.id, t.text FROM token t
    LOOP
        BEGIN
            RAISE NOTICE 'Converting %', r.text;
            r.text := convert_from(convert_to(r.text,encoding_name),'UTF8');
            RAISE NOTICE 'Converted to %', r.text;
            RAISE NOTICE 'Checking existence.';
            SELECT id INTO token_id FROM token WHERE text = r.text;             
            IF (token_id IS NOT NULL) THEN
                BEGIN
                    RAISE NOTICE 'Token already exists. Updating ids in textblockhastoken';
                    IF(token_id = r.id) THEN
                        RAISE NOTICE 'Token is the same.';
                        CONTINUE;
                    END IF;
                    UPDATE textblockhastoken SET tokenid = token_id
                    WHERE tokenid = r.id;
                    RAISE NOTICE 'Removing current token.';
                    DELETE FROM token WHERE id = r.id;
                END;
            ELSE
                BEGIN
                    RAISE NOTICE 'Token don''t exists. Updating text in token';
                    UPDATE token SET text = r.text WHERE id = r.id;
                END;
            END IF;
            EXCEPTION WHEN untranslatable_character THEN
                --do nothing
            WHEN character_not_in_repertoire THEN
                --do nothing
            END;
            counter = counter + 1;
            RAISE NOTICE '% token converted', counter;
    END LOOP;
END
$$
LANGUAGE plpgsql;
like image 144
Renato Dinhani Avatar answered Sep 24 '22 06:09

Renato Dinhani