How fix double encoding in PostgreSQL?

Tags:

I have a table in PostgreSQL with words, but some words have invalid UTF-8 chars like 0xe7e36f and 0xefbfbd.

How I can identify all chars inside words that are invalid and replace they with some symbol like ??

EDIT: My database is in UTF-8, but I think there are double encoding from various other encodings. I think this because when I tried to convert to one type as LATIN1, I get an error saying that some char don't exists in that encoding, when I change to LATIN2 I get the same error, but with another character.

So, what is possible to do to solve this?

998

asked Nov 18 '11 16:11

Renato Dinhani

1 Answers

Usage

It's a solution for my specific case, but maybe with some modifications can help another people.

Usage

SELECT fix_wrong_encoding('LATIN1');

Function

-- Convert words with wrong encoding
CREATE OR REPLACE FUNCTION fix_wrong_encoding(encoding_name VARCHAR)
RETURNS VOID
AS $$
DECLARE     
    r RECORD;
    counter INTEGER;
    token_id INTEGER;
BEGIN
    counter = 0;
    FOR r IN SELECT t.id, t.text FROM token t
    LOOP
        BEGIN
            RAISE NOTICE 'Converting %', r.text;
            r.text := convert_from(convert_to(r.text,encoding_name),'UTF8');
            RAISE NOTICE 'Converted to %', r.text;
            RAISE NOTICE 'Checking existence.';
            SELECT id INTO token_id FROM token WHERE text = r.text;             
            IF (token_id IS NOT NULL) THEN
                BEGIN
                    RAISE NOTICE 'Token already exists. Updating ids in textblockhastoken';
                    IF(token_id = r.id) THEN
                        RAISE NOTICE 'Token is the same.';
                        CONTINUE;
                    END IF;
                    UPDATE textblockhastoken SET tokenid = token_id
                    WHERE tokenid = r.id;
                    RAISE NOTICE 'Removing current token.';
                    DELETE FROM token WHERE id = r.id;
                END;
            ELSE
                BEGIN
                    RAISE NOTICE 'Token don''t exists. Updating text in token';
                    UPDATE token SET text = r.text WHERE id = r.id;
                END;
            END IF;
            EXCEPTION WHEN untranslatable_character THEN
                --do nothing
            WHEN character_not_in_repertoire THEN
                --do nothing
            END;
            counter = counter + 1;
            RAISE NOTICE '% token converted', counter;
    END LOOP;
END
$$
LANGUAGE plpgsql;

144

answered Sep 24 '22 06:09

Renato Dinhani

Related questions
                            
                                Strange duplicate behavior from GROUP_CONCAT of two LEFT JOINs of GROUP_BYs
                            
                                SQL sorting by number and stay grouped
                            
                                Weird result of using CTE
                            
                                Get User Data in a Formatted Way With Sql Query
                            
                                PIVOT query on Distinct records
                            
                                Min/Max Date Values over Large Date Range depending on Value
                            
                                Enforce referential integrity in a ternary relation
                            
                                What is the best way to communicate with a MySQL server?
                            
                                Ant task for generating ER diagram from JPA/Hibernate annotated classes
                            
                                Array variable in mysql
                            
                                What is the equivalent of timestamp/rowversion (SQL Server) with PostgreSQL
                            
                                Determining SQL data path for DB RESTORE with MOVE
                            
                                How can this SQL query code be broken/exploited by user input? [duplicate]
                            
                                Msg 102, Level 15, State 1, Line 1 Incorrect syntax near ' '
                            
                                SQL to get friends AND friends of friends of a user
                            
                                Unique constraint using data in multiple tables (SQL / SQLAlchemy)
                            
                                Using extra() on ValuesQuerySet in Django
                            
                                Intervals: How can I make sure there is just one row with a null value in a timstamp column in table?
                            
                                What is the difference between the sql blob and image types
                            
                                viewing exact sql after parameter substitution C#

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How fix double encoding in PostgreSQL?

Tags:

sql

postgresql

encoding

utf-8

Renato Dinhani

People also ask

1 Answers

Usage

Usage

Function

Renato Dinhani

Recent Activity

Donate For Us