Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Writing results from SQL query to CSV and avoiding extra line-breaks

I have to extract data from several different database engines. After this data is exported, I send the data to AWS S3 and copy that data to Redshift using a COPY command. Some of the tables contain lots of text, with line breaks and other characters present in the column fields. When I run the following code:

cursor.execute('''SELECT * FROM some_schema.some_message_log''')
rows = cursor.fetchall()
with open('data.csv', 'w', newline='') as fp:
    a = csv.writer(fp, delimiter='|', quoting=csv.QUOTE_ALL, quotechar='"', doublequote=True, lineterminator='\n')
    a.writerows(rows)

Some of the columns that have carriage returns/linebreaks will create new lines:

"2017-01-05 17:06:32.802700"|"SampleJob"|""|"Date"|"error"|"Job.py"|"syntax error at or near ""from"" LINE 34: select *, SYSDATE, from staging_tops.tkabsences;
                                      ^
-<class 'psycopg2.ProgrammingError'>"

which causes the import process to fail. I can work around this by hard-coding for exceptions:

cursor.execute('''SELECT * FROM some_schema.some_message_log''')
rows = cursor.fetchall()
with open('data.csv', 'w', newline='') as fp:
    a = csv.writer(fp, delimiter='|', quoting=csv.QUOTE_ALL, quotechar='"', doublequote=True, lineterminator='\n')

for row in rows:
    list_of_rows = []
    for c in row:
        if isinstance(c, str):
            c = c.replace("\n", "\\n")
            c = c.replace("|", "\|")
            c = c.replace("\\", "\\\\")
            list_of_rows.append(c)
        else:
            list_of_rows.append(c)
    a.writerow([x.encode('utf-8') if isinstance(x, str) else x for x in list_of_rows])

But this takes a long time to process larger files, and seems like bad practice in general. Is there a faster way to export data from a SQL cursor to CSV that will not break when faced with text columns that contain carriage returns/line breaks?

like image 704
user2752159 Avatar asked Feb 14 '18 18:02

user2752159


People also ask

How do you handle commas in data when exporting to a CSV file SQL?

Add a "text qualifier" in the export definition. The usual is double-quotes. Or use tabs instead of commas as the delimiter, if you can. .csv files work with either, in most cases. Can work with pipes (also called "vertical bar"; key above Enter on your keyboard, shift-backslash), too.

How do I save SQL data in CSV format?

Go to "Object Explorer", find the server database you want to export in CSV. Right-click on it and choose "Tasks" > "Export Data" to export table data in CSV. Then, the SQL Server Import and Export Wizard welcome window pop up.


2 Answers

If you're doing SELECT * FROM table without a WHERE clause, you could use COPY table TO STDOUT instead, with the right options:

copy_command = """COPY some_schema.some_message_log TO STDOUT
        CSV QUOTE '"' DELIMITER '|' FORCE QUOTE *"""

with open('data.csv', 'w', newline='') as fp:
    cursor.copy_expert(copy_command)

This, in my testing, results in literal '\n' instead of actual newlines, where writing through the csv writer gives broken lines.

If you do need a WHERE clause in production you could create a temporary table and copy it instead:

cursor.execute("""CREATE TEMPORARY TABLE copy_me AS
        SELECT this, that, the_other FROM table_name WHERE conditions""")

(edit) Looking at your question again I see you mention "ever all different database engines". The above works with psyopg2 and postgresql but could probably be adapted for other databases or libraries.

like image 190
Nathan Vērzemnieks Avatar answered Oct 19 '22 17:10

Nathan Vērzemnieks


I suspect the issue is as simple as making sure the Python CSV export library and Redshift's COPY import speak a common interface. In short, check your delimiters and quoting characters and make sure both the Python output and the Redshift COPY command agree.

With slightly more detail: the DB drivers will have already done the hard work of getting to Python in a well-understood form. That is, each row from the DB is a list (or tuple, generator, etc.), and each cell is individually accessible. And at the point you have a list-like structure, Python's CSV exporter can do the rest of the work and -- crucially -- Redshift will be able to COPY FROM the output, embedded newlines and all. In particular, you should not need to do any manual escaping; the .writerow() or .writerows() functions should be all you need do.

Redshift's COPY implementation understands the most common dialect of CSV by default, which is to

  • delimit cells by a comma (,),
  • quote cells with double quotes ("),
  • and to escape any embedded double quotes by doubling (""").

To back that up with documentation from Redshift FORMAT AS CSV:

... The default quote character is a double quotation mark ( " ). When the quote character is used within a field, escape the character with an additional quote character. ...

However, your Python CSV export code uses a pipe (|) as the delimiter and sets the quotechar to double quote ("). That, too, can work, but why stray from the defaults? Suggest using CSV's namesake and keeping your code simpler in the process:

cursor.execute('''SELECT * FROM some_schema.some_message_log''')
rows = cursor.fetchall()
with open('data.csv', 'w') as fp:
    csvw = csv.writer( fp )
    csvw.writerows(rows)

From there, tell COPY to use the CSV format (again with no need for non-default specifications):

COPY  your_table  FROM  your_csv_file  auth_code  FORMAT AS CSV;

That should do it.

like image 35
hunteke Avatar answered Oct 19 '22 18:10

hunteke