I have to extract data from several different database engines. After this data is exported, I send the data to AWS S3 and copy that data to Redshift using a COPY command. Some of the tables contain lots of text, with line breaks and other characters present in the column fields. When I run the following code: <pre class="prettyprint"><code>cursor.execute('''SELECT * FROM some_schema.some_message_log''') rows = cursor.fetchall() with open('data.csv', 'w', newline='') as fp: a = csv.writer(fp, delimiter='|', quoting=csv.QUOTE_ALL, quotechar='"', doublequote=True, lineterminator='\n') a.writerows(rows) </code></pre> Some of the columns that have carriage returns/linebreaks will create new lines: <pre class="prettyprint"><code>"2017-01-05 17:06:32.802700"|"SampleJob"|""|"Date"|"error"|"Job.py"|"syntax error at or near ""from"" LINE 34: select *, SYSDATE, from staging_tops.tkabsences; ^ -<class 'psycopg2.ProgrammingError'>" </code></pre> which causes the import process to fail. I can work around this by hard-coding for exceptions: <pre class="prettyprint"><code>cursor.execute('''SELECT * FROM some_schema.some_message_log''') rows = cursor.fetchall() with open('data.csv', 'w', newline='') as fp: a = csv.writer(fp, delimiter='|', quoting=csv.QUOTE_ALL, quotechar='"', doublequote=True, lineterminator='\n') for row in rows: list_of_rows = [] for c in row: if isinstance(c, str): c = c.replace("\n", "\\n") c = c.replace("|", "\|") c = c.replace("\\", "\\\\") list_of_rows.append(c) else: list_of_rows.append(c) a.writerow([x.encode('utf-8') if isinstance(x, str) else x for x in list_of_rows]) </code></pre> But this takes a long time to process larger files, and seems like bad practice in general. Is there a faster way to export data from a SQL cursor to CSV that will not break when faced with text columns that contain carriage returns/line breaks?

If you're doing <code>SELECT * FROM table</code> without a <code>WHERE</code> clause, you could use <code>COPY table TO STDOUT</code> instead, with the right options: <pre class="prettyprint lang-none prettyprint-override"><code>copy_command = """COPY some_schema.some_message_log TO STDOUT CSV QUOTE '"' DELIMITER '|' FORCE QUOTE *""" with open('data.csv', 'w', newline='') as fp: cursor.copy_expert(copy_command) </code></pre> This, in my testing, results in literal '\n' instead of actual newlines, where writing through the csv writer gives broken lines. If you do need a <code>WHERE</code> clause in production you could create a temporary table and copy it instead: <pre class="prettyprint lang-none prettyprint-override"><code>cursor.execute("""CREATE TEMPORARY TABLE copy_me AS SELECT this, that, the_other FROM table_name WHERE conditions""") </code></pre> (edit) Looking at your question again I see you mention "ever all different database engines". The above works with psyopg2 and postgresql but could probably be adapted for other databases or libraries.

I suspect the issue is as simple as making sure the Python CSV export library and Redshift's COPY import speak a common interface. In short, check your delimiters and quoting characters and make sure both the Python output and the Redshift COPY command agree. With slightly more detail: the DB drivers will have already done the hard work of getting to Python in a well-understood form. That is, each row from the DB is a list (or tuple, generator, etc.), and each cell is individually accessible. And at the point you have a list-like structure, Python's CSV exporter can do the rest of the work and -- crucially -- Redshift will be able to COPY FROM the output, embedded newlines and all. In particular, you should not need to do any manual escaping; the <code>.writerow()</code> or <code>.writerows()</code> functions should be all you need do. Redshift's COPY implementation understands the most common dialect of CSV by default, which is to <ul> <li>delimit cells by a comma (<code>,</code>),</li> <li>quote cells with double quotes (<code>"</code>),</li> <li>and to escape any embedded double quotes by doubling (<code>"</code> → <code>""</code>).</li> </ul> To back that up with documentation from Redshift <code>FORMAT AS CSV</code>: <blockquote> ... The default quote character is a double quotation mark ( " ). When the quote character is used within a field, escape the character with an additional quote character. ... </blockquote> However, your Python CSV export code uses a pipe (<code>|</code>) as the <code>delimiter</code> and sets the <code>quotechar</code> to double quote (<code>"</code>). That, too, can work, but why stray from the defaults? Suggest using CSV's namesake and keeping your code simpler in the process: <pre class="prettyprint"><code>cursor.execute('''SELECT * FROM some_schema.some_message_log''') rows = cursor.fetchall() with open('data.csv', 'w') as fp: csvw = csv.writer( fp ) csvw.writerows(rows) </code></pre> From there, tell COPY to use the CSV format (again with no need for non-default specifications): <pre class="prettyprint"><code>COPY your_table FROM your_csv_file auth_code FORMAT AS CSV; </code></pre> That should do it.

Writing results from SQL query to CSV and avoiding extra line-breaks

Tags:

python

sql

csv

amazon-web-services

database-cursor

I have to extract data from several different database engines. After this data is exported, I send the data to AWS S3 and copy that data to Redshift using a COPY command. Some of the tables contain lots of text, with line breaks and other characters present in the column fields. When I run the following code:

cursor.execute('''SELECT * FROM some_schema.some_message_log''')
rows = cursor.fetchall()
with open('data.csv', 'w', newline='') as fp:
    a = csv.writer(fp, delimiter='|', quoting=csv.QUOTE_ALL, quotechar='"', doublequote=True, lineterminator='\n')
    a.writerows(rows)

Some of the columns that have carriage returns/linebreaks will create new lines:

"2017-01-05 17:06:32.802700"|"SampleJob"|""|"Date"|"error"|"Job.py"|"syntax error at or near ""from"" LINE 34: select *, SYSDATE, from staging_tops.tkabsences;
                                      ^
-<class 'psycopg2.ProgrammingError'>"

which causes the import process to fail. I can work around this by hard-coding for exceptions:

cursor.execute('''SELECT * FROM some_schema.some_message_log''')
rows = cursor.fetchall()
with open('data.csv', 'w', newline='') as fp:
    a = csv.writer(fp, delimiter='|', quoting=csv.QUOTE_ALL, quotechar='"', doublequote=True, lineterminator='\n')

for row in rows:
    list_of_rows = []
    for c in row:
        if isinstance(c, str):
            c = c.replace("\n", "\\n")
            c = c.replace("|", "\|")
            c = c.replace("\\", "\\\\")
            list_of_rows.append(c)
        else:
            list_of_rows.append(c)
    a.writerow([x.encode('utf-8') if isinstance(x, str) else x for x in list_of_rows])

But this takes a long time to process larger files, and seems like bad practice in general. Is there a faster way to export data from a SQL cursor to CSV that will not break when faced with text columns that contain carriage returns/line breaks?

704

asked Feb 14 '18 18:02

user2752159

2 Answers

If you're doing SELECT * FROM table without a WHERE clause, you could use COPY table TO STDOUT instead, with the right options:

copy_command = """COPY some_schema.some_message_log TO STDOUT
        CSV QUOTE '"' DELIMITER '|' FORCE QUOTE *"""

with open('data.csv', 'w', newline='') as fp:
    cursor.copy_expert(copy_command)

This, in my testing, results in literal '\n' instead of actual newlines, where writing through the csv writer gives broken lines.

If you do need a WHERE clause in production you could create a temporary table and copy it instead:

cursor.execute("""CREATE TEMPORARY TABLE copy_me AS
        SELECT this, that, the_other FROM table_name WHERE conditions""")

(edit) Looking at your question again I see you mention "ever all different database engines". The above works with psyopg2 and postgresql but could probably be adapted for other databases or libraries.

190

answered Oct 19 '22 17:10

Nathan Vērzemnieks

I suspect the issue is as simple as making sure the Python CSV export library and Redshift's COPY import speak a common interface. In short, check your delimiters and quoting characters and make sure both the Python output and the Redshift COPY command agree.

With slightly more detail: the DB drivers will have already done the hard work of getting to Python in a well-understood form. That is, each row from the DB is a list (or tuple, generator, etc.), and each cell is individually accessible. And at the point you have a list-like structure, Python's CSV exporter can do the rest of the work and -- crucially -- Redshift will be able to COPY FROM the output, embedded newlines and all. In particular, you should not need to do any manual escaping; the .writerow() or .writerows() functions should be all you need do.

Redshift's COPY implementation understands the most common dialect of CSV by default, which is to

delimit cells by a comma (,),
quote cells with double quotes ("),
and to escape any embedded double quotes by doubling (" → "").

To back that up with documentation from Redshift FORMAT AS CSV:

... The default quote character is a double quotation mark ( " ). When the quote character is used within a field, escape the character with an additional quote character. ...

However, your Python CSV export code uses a pipe (|) as the delimiter and sets the quotechar to double quote ("). That, too, can work, but why stray from the defaults? Suggest using CSV's namesake and keeping your code simpler in the process:

cursor.execute('''SELECT * FROM some_schema.some_message_log''')
rows = cursor.fetchall()
with open('data.csv', 'w') as fp:
    csvw = csv.writer( fp )
    csvw.writerows(rows)

From there, tell COPY to use the CSV format (again with no need for non-default specifications):

COPY  your_table  FROM  your_csv_file  auth_code  FORMAT AS CSV;

That should do it.

answered Oct 19 '22 18:10

hunteke

Related questions
                            
                                Importing external treebank-style BLLIP corpus using NLTK
                            
                                pymongo.errors.BulkWriteError: batch op errors occurred (MongoDB 3.4.2, pymongo 3.4.0, python 2.7.13)
                            
                                Why `__iter__` does not work when defined as an instance variable?
                            
                                Import forked module in Python instead of installed module
                            
                                What is a good alternative to Firebase for user management, more specifically for Python?
                            
                                Python pandas datareader isn't working [closed]
                            
                                Difference Between Keras Input Layer and Tensorflow Placeholders
                            
                                'module' object has no attribute 'feature_column'
                            
                                OpenAI gym mujoco ImportError: No module named 'mujoco_py.mjlib'
                            
                                Flask tutorial: Why do we use the app context for the DB connection?
                            
                                Scrapy CrawlSpider + Splash: how to follow links through linkextractor?
                            
                                FastText - Cannot load model.bin due to C++ extension failed to allocate the memory
                            
                                Why does df.apply(tuple) work but not df.apply(list)?
                            
                                Finding the union of multiple overlapping rectangles - OpenCV python
                            
                                Is it possible to parallelize bz2's decompression?
                            
                                mypy: Signature of "__getitem__" incompatible with supertype "Sequence"
                            
                                Python : How to interpret the result of logistic regression by sm.Logit
                            
                                TensorFlow estimator.predict() gives WARNING:tensorflow:Input graph does not contain a QueueRunner
                            
                                TypeError: unsupported operand type(s) for +: 'set' and 'set'
                            
                                Spark/PySpark: An error occurred while trying to connect to the Java server (127.0.0.1:39543)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With