String contains invalid or unsupported UTF8 codepoints. Bad UTF8 hex sequence:

Question

Team, I am using redshift version *(8.0.2 ). while loading data using COPY command, I get an error: - "String contains invalid or unsupported UTF8 codepoints, Bad UTF8 hex sequence: bf (error 3)". It seems COPY trying to load UTF-8 "bf" into VARCHAR field. As per Amazon redshift, this error code 3 defines below:

error code3: 
The UTF-8 single-byte character is out of range. The starting byte must not be 254, 255
or any character between 128 and 191 (inclusive).

Amazon recommnds this as solution - we need to go replace the character with a valid UTF-8 code sequence or remove the character.

could you please help me how to replace the character with valid UTF-8 code ?

when i checked database properties in PG-ADMIN, it shows the encoding as UTF-8.

Please guide me how to replace the character in the input delimited file.

Thanks...

Pat · Accepted Answer

I've run into this issue in RedShift while loading TPC-DS datasets for experiments.

Here is the documentation and forum chatter I found via AWS:https://forums.aws.amazon.com/ann.jspa?annID=2090

And here is the explicit commands you can use to solve data conversion errors:http://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-conversion.html#copy-acceptinvchars

You can explicitly replace the invalid UTF-8 characters or disregard them all together during the COPY phase by stating ACCEPTINVCHARS.

Try this:

copy table from 's3://my-bucket/my-path
credentials 'aws_iam_role=<your role arn>'
ACCEPTINVCHARS 
delimiter '|' region 'us-region-1';

Warnings:
Load into table 'table' completed, 500000 record(s) loaded successfully.
Load into table 'table' completed, 4510 record(s) were loaded with replacements made for ACCEPTINVCHARS. Check 'stl_replacements' system table for details.

0 rows affected
COPY executed successfully

Execution time: 33.51s

packmul3 · Answer

For many people loading CSVs into databases, they get their files from someone using Excel or they have access to Excel. If so, this problem is quickly solved by:

First saving the file out of Excel using the Save As and selecting CSV UTF-8 (Comma Delimited) (*.csv) format, by requesting/training those giving you the files to use this export format. Note many people by default export to csv using the CSV (Comma delimited) (*.csv) format and there is a difference.
Loading the csv into Excel and then immediately Saving As to the UTF-8 csv format.

Of course it wouldn't work for files unusable by Excel, ie. larger than 1 million rows, etc. Then I would use the iconv suggestion by mike_pdb

mike_pdb · Answer

Sounds like the encoding of your file might not be utf-8. You might try this technique that we use sometimes

cat myfile.tsv| iconv -c -f ISO-8859-1 -t utf8 > myfile_utf8.tsv

String contains invalid or unsupported UTF8 codepoints. Bad UTF8 hex sequence:

Tags:

amazon-redshift

user3258784

3 Answers

Pat

packmul3

mike_pdb

Recent Activity

Donate For Us

String contains invalid or unsupported UTF8 codepoints. Bad UTF8 hex sequence:

Tags:

amazon-redshift

user3258784

3 Answers

Pat

packmul3

mike_pdb

Related questions

Recent Activity

Donate For Us