I have to extract data from several different database engines. After this data is exported, I send the data to AWS S3 and copy that data to Redshift using a COPY command. Some of the tables contain lots of text, with line breaks and other characters present in the column fields. When I run the following code:
cursor.execute('''SELECT * FROM some_schema.some_message_log''')
rows = cursor.fetchall()
with open('data.csv', 'w', newline='') as fp:
a = csv.writer(fp, delimiter='|', quoting=csv.QUOTE_ALL, quotechar='"', doublequote=True, lineterminator='\n')
a.writerows(rows)
Some of the columns that have carriage returns/linebreaks will create new lines:
"2017-01-05 17:06:32.802700"|"SampleJob"|""|"Date"|"error"|"Job.py"|"syntax error at or near ""from"" LINE 34: select *, SYSDATE, from staging_tops.tkabsences;
^
-<class 'psycopg2.ProgrammingError'>"
which causes the import process to fail. I can work around this by hard-coding for exceptions:
cursor.execute('''SELECT * FROM some_schema.some_message_log''')
rows = cursor.fetchall()
with open('data.csv', 'w', newline='') as fp:
a = csv.writer(fp, delimiter='|', quoting=csv.QUOTE_ALL, quotechar='"', doublequote=True, lineterminator='\n')
for row in rows:
list_of_rows = []
for c in row:
if isinstance(c, str):
c = c.replace("\n", "\\n")
c = c.replace("|", "\|")
c = c.replace("\\", "\\\\")
list_of_rows.append(c)
else:
list_of_rows.append(c)
a.writerow([x.encode('utf-8') if isinstance(x, str) else x for x in list_of_rows])
But this takes a long time to process larger files, and seems like bad practice in general. Is there a faster way to export data from a SQL cursor to CSV that will not break when faced with text columns that contain carriage returns/line breaks?
Add a "text qualifier" in the export definition. The usual is double-quotes. Or use tabs instead of commas as the delimiter, if you can. .csv files work with either, in most cases. Can work with pipes (also called "vertical bar"; key above Enter on your keyboard, shift-backslash), too.
Go to "Object Explorer", find the server database you want to export in CSV. Right-click on it and choose "Tasks" > "Export Data" to export table data in CSV. Then, the SQL Server Import and Export Wizard welcome window pop up.
If you're doing SELECT * FROM table
without a WHERE
clause, you could use COPY table TO STDOUT
instead, with the right options:
copy_command = """COPY some_schema.some_message_log TO STDOUT
CSV QUOTE '"' DELIMITER '|' FORCE QUOTE *"""
with open('data.csv', 'w', newline='') as fp:
cursor.copy_expert(copy_command)
This, in my testing, results in literal '\n' instead of actual newlines, where writing through the csv writer gives broken lines.
If you do need a WHERE
clause in production you could create a temporary table and copy it instead:
cursor.execute("""CREATE TEMPORARY TABLE copy_me AS
SELECT this, that, the_other FROM table_name WHERE conditions""")
(edit) Looking at your question again I see you mention "ever all different database engines". The above works with psyopg2 and postgresql but could probably be adapted for other databases or libraries.
I suspect the issue is as simple as making sure the Python CSV export library and Redshift's COPY import speak a common interface. In short, check your delimiters and quoting characters and make sure both the Python output and the Redshift COPY command agree.
With slightly more detail: the DB drivers will have already done the hard work of getting to Python in a well-understood form. That is, each row from the DB is a list (or tuple, generator, etc.), and each cell is individually accessible. And at the point you have a list-like structure, Python's CSV exporter can do the rest of the work and -- crucially -- Redshift will be able to COPY FROM the output, embedded newlines and all. In particular, you should not need to do any manual escaping; the .writerow()
or .writerows()
functions should be all you need do.
Redshift's COPY implementation understands the most common dialect of CSV by default, which is to
,
),"
),"
→ ""
).To back that up with documentation from Redshift FORMAT AS CSV
:
... The default quote character is a double quotation mark ( " ). When the quote character is used within a field, escape the character with an additional quote character. ...
However, your Python CSV export code uses a pipe (|
) as the delimiter
and sets the quotechar
to double quote ("
). That, too, can work, but why stray from the defaults? Suggest using CSV's namesake and keeping your code simpler in the process:
cursor.execute('''SELECT * FROM some_schema.some_message_log''')
rows = cursor.fetchall()
with open('data.csv', 'w') as fp:
csvw = csv.writer( fp )
csvw.writerows(rows)
From there, tell COPY to use the CSV format (again with no need for non-default specifications):
COPY your_table FROM your_csv_file auth_code FORMAT AS CSV;
That should do it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With