I am trying to import csv into aws redshift( postgresql 8.x) .
The data flow is: mysql -> parquet files on s3 -> csv files on s3 -> redshift.
The mysql table sql:
create table orderitems
(
id char(36) collate utf8_bin not null
primary key,
store_id char(36) collate utf8_bin not null,
ref_type int not null,
ref_id char(36) collate utf8_bin not null,
store_product_id char(36) collate utf8_bin not null,
product_id char(36) collate utf8_bin not null,
product_name varchar(50) null,
main_image varchar(200) null,
price int not null,
count int not null,
logistics_type int not null,
time_create bigint not null,
time_update bigint not null,
...
);
I used same sql to create table in redshift , but it got error while importing csv.
# parquet is dumpy by sqoop
p2 = 'xxx'
df = pd.read_parquet(path)
with smart_open.smart_open(p2, 'w') as f:
df.to_csv(f, index=False) # python3 default encoding is utf-8
conn = psycopg2.connect(CONN_STRING)
sql="""COPY %s FROM '%s' credentials 'aws_iam_role=%s' region 'cn-north-1'
delimiter ',' FORMAT AS CSV IGNOREHEADER 1 ; commit ;""" % (to_table, p2, AWS_IAM_ROLE)
print(sql)
cur = conn.cursor()
cur.execute(sql)
conn.close()
By checking STL_LOAD_ERRORS
found error on product_name
column
The real_value is 伊利畅轻蔓越莓奇亚籽风味发酵乳215g/瓶
( chinese) .
So it looks like some encoding problem. Since mysql is utf-8 and the csv is utf-8 too , I don't know what is wrong .
Your column is a varchar data type, with length 50. That's 50 bytes, not 50 characters. The string example you've given looks to be about 16 chinese characters, which are probably 3 bytes each (UTF-8) and four ASCII characters (one byte each), so about 52 bytes. That's longer than the byte length of the column, so the import fails.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With