Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

S3 -> Redshift cannot handle UTF8

We have a file in S3 that is loaded in to Redshift via the COPY command. The import is failing because a VARCHAR(20) value contains an Ä which is being translated into .. during the copy command and is now too long for the 20 characters.

I have verified that the data is correct in S3, but the COPY command does not understand the UTF-8 characters during import. Has anyone found a solution for this?

like image 202
Elliot Chance Avatar asked Dec 22 '14 23:12

Elliot Chance


People also ask

What is the max size for a varchar in redshift?

You can create an Amazon Redshift table with a TEXT column, but it is converted to a VARCHAR(256) column that accepts variable-length values with a maximum of 256 characters.

Which format is not supported by copy command?

Save this question.

What is coalesce in redshift?

Redshift coalesce function is used to return the first not null value from table column or argument list which was we have used in query. This function will evaluates values or arguments from left to right to find the not-null value or argument.

How do you escape a character in redshift?

A character expression that will escape metacharacters characters in the pattern. The default is two backslashes ('\\').


1 Answers

tl;dr

the byte length for your varchar column just needs to be larger.

Detail

Multi-byte characters (UTF-8) are supported in the varchar data type, however the length that is provided is in bytes, NOT characters.

AWS documentation for Multibyte Character Load Errors states the following:

VARCHAR columns accept multibyte UTF-8 characters, to a maximum of four bytes.

Therefore if you want the character Ä to be allowed, then you need to allow 2 bytes for this character, instead of 1 byte.

AWS documentation for VARCHAR or CHARACTER VARYING states the following:

... so a VARCHAR(120) column consists of a maximum of 120 single-byte characters, 60 two-byte characters, 40 three-byte characters, or 30 four-byte characters.

For a list of UTF-8 characters and their byte lengths, this is a good reference: Complete Character List for UTF-8

Detailed information for the Unicode Character 'LATIN CAPITAL LETTER A WITH DIAERESIS' (U+00C4) can be found here.

like image 76
Adrian Torrie Avatar answered Sep 29 '22 09:09

Adrian Torrie