How to keep null values when writing to csv

Tags:

I'm writing data from sql server into a csv file using Python's csv module and then uploading the csv file to a postgres database using the copy command. The issue is that Python's csv writer automatically converts Nulls into an empty string "" and it fails my job when the column is an int or float datatype and it tries to insert this "" when it should be a None or null value.

To make it as easy as possible to interface with modules which implement the DB API, the value None is written as the empty string.

https://docs.python.org/3.4/library/csv.html?highlight=csv#csv.writer

What is the best way to keep the null value? Is there a better way to write csvs in Python? I'm open to all suggestions.

Example:

I have lat and long values:

42.313270000    -71.116240000
42.377010000    -71.064770000
NULL    NULL

When writing to csv it converts nulls to "":

with file_path.open(mode='w', newline='') as outfile:
    csv_writer = csv.writer(outfile, delimiter=',', quoting=csv.QUOTE_NONNUMERIC)
    if include_headers:
        csv_writer.writerow(col[0] for col in self.cursor.description)
    for row in self.cursor:
        csv_writer.writerow(row)

42.313270000,-71.116240000
42.377010000,-71.064770000
"",""

NULL

Specifies the string that represents a null value. The default is \N (backslash-N) in text format, and an unquoted empty string in CSV format. You might prefer an empty string even in text format for cases where you don't want to distinguish nulls from empty strings. This option is not allowed when using binary format.

https://www.postgresql.org/docs/9.2/sql-copy.html

ANSWER:

What solved the problem for me was changing the quoting to csv.QUOTE_MINIMAL.

csv.QUOTE_MINIMAL Instructs writer objects to only quote those fields which contain special characters such as delimiter, quotechar or any of the characters in lineterminator.

Related questions: - Postgresql COPY empty string as NULL not work

828

asked Feb 21 '19 21:02

Jonathan Porter

4 Answers

You have two options here: change the csv.writing quoting option in Python, or tell PostgreSQL to accept quoted strings as possible NULLs (requires PostgreSQL 9.4 or newer)

Python `csv.writer()` and quoting

On the Python side, you are telling the csv.writer() object to add quotes, because you configured it to use csv.QUOTE_NONNUMERIC:

Instructs writer objects to quote all non-numeric fields.

None values are non-numeric, so result in "" being written.

Switch to using csv.QUOTE_MINIMAL or csv.QUOTE_NONE:

csv.QUOTE_MINIMAL
Instructs writer objects to only quote those fields which contain special characters such as delimiter, quotechar or any of the characters in lineterminator.

csv.QUOTE_NONE
Instructs writer objects to never quote fields. When the current delimiter occurs in output data it is preceded by the current escapechar character.

Since all you are writing is longitude and latitude values, you don't need any quoting here, there are no delimiters or quotecharacters present in your data.

With either option, the CSV output for None values is simple an empty string:

>>> import csv
>>> from io import StringIO
>>> def test_csv_writing(rows, quoting):
...     outfile = StringIO()
...     csv_writer = csv.writer(outfile, delimiter=',', quoting=quoting)
...     csv_writer.writerows(rows)
...     return outfile.getvalue()
...
>>> rows = [
...     [42.313270000, -71.116240000],
...     [42.377010000, -71.064770000],
...     [None, None],
... ]
>>> print(test_csv_writing(rows, csv.QUOTE_NONNUMERIC))
42.31327,-71.11624
42.37701,-71.06477
"",""

>>> print(test_csv_writing(rows, csv.QUOTE_MINIMAL))
42.31327,-71.11624
42.37701,-71.06477
,

>>> print(test_csv_writing(rows, csv.QUOTE_NONE))
42.31327,-71.11624
42.37701,-71.06477
,

PostgreSQL 9.4 `COPY FROM`, `NULL` values and `FORCE_NULL`

As of PostgreSQL 9.4, you can also force PostgreSQL to accept quoted empty strings as NULLs, when you use the FORCE_NULL option. From the COPY FROM documentation:

FORCE_NULL

Match the specified columns' values against the null string, even if it has been quoted, and if a match is found set the value to NULL. In the default case where the null string is empty, this converts a quoted empty string into NULL. This option is allowed only in COPY FROM, and only when using CSV format.

Naming the columns in a FORCE_NULL option lets PostgreSQL accept both the empty column and "" as NULL values for those columns, e.g.:

COPY position (
    lon, 
    lat
) 
FROM "filename"
WITH (
    FORMAT csv,
    NULL '',
    DELIMITER ',',
    FORCE_NULL(lon, lat)
);

at which point it doesn't matter anymore what quoting options you used on the Python side.

Other options to consider

For simple data transformation tasks from other databases, don't use Python

If you already querying databases to collate data to go into PostgreSQL, consider directly inserting into Postgres. If the data comes from other sources, using the foreign data wrapper (fdw) module lets you cut out the middle-man and directly pull data into PostgreSQL from other sources.

Numpy data? Consider using COPY FROM as binary, directly from Python

Numpy data can more efficiently be inserted via binary COPY FROM; the linked answer augments a numpy structured array with the required extra metadata and byte ordering, then efficiently creates a binary copy of the data and inserts it into PostgreSQL using COPY FROM STDIN WITH BINARY and the psycopg2.copy_expert() method. This neatly avoids number -> text -> number conversions.

Persisting data to handle large datasets in a pipeline?

Don't re-invent the data pipeline wheels. Consider using existing projects such as Apache Spark, which have already solved the efficiency problems. Spark lets you treat data as a structured stream, and includes the infrastructure to run data analysis steps in parallel, and you can treat distributed, structured data as Pandas dataframes.

Another option might be to look at Dask to help share datasets between distributed tasks to process large amounts of data.

Even if converting an already running project to Spark might be a step too far, at least consider using Apache Arrow, the data exchange platform Spark builds on top of. The pyarrow project would let you exchange data via Parquet files, or exchange data over IPC.

The Pandas and Numpy teams are quite heavily invested in supporting the needs of Arrow and Dask (there is considerable overlap in core members between these projects) and are actively working to make Python data exchange as efficient as possible, including extending Python's pickle module to allow for out-of-band data streams to avoid unnecessary memory copying when sharing data.

143

answered Oct 13 '22 11:10

Martijn Pieters

your code

for row in self.cursor:
    csv_writer.writerow(row)

uses writer as-is, but you don't have to do that. You can filter the values to change some particular values with a generator comprehension and a ternary expression

for row in self.cursor:
    csv_writer.writerow("null" if x is None else x for x in row)

answered Oct 13 '22 12:10

Jean-François Fabre

You are asking for csv.QUOTE_NONNUMERIC. This will turn everything that is not a number into a string. You should consider using csv.QUOTE_MINIMAL as it might be more what you are after:

Test Code:

import csv

test_data = (None, 0, '', 'data')
for name, quotes in (('test1.csv', csv.QUOTE_NONNUMERIC),
                     ('test2.csv', csv.QUOTE_MINIMAL)):

    with open(name, mode='w') as outfile:
        csv_writer = csv.writer(outfile, delimiter=',', quoting=quotes)
        csv_writer.writerow(test_data))

Results:

test1.csv:

"",0,"","data"

test2.csv:

,0,,data

answered Oct 13 '22 12:10

Stephen Rauch

I'm writing data from sql server into a csv file using Python's csv module and then uploading the csv file to a postgres database using the copy command.

I believe your true requirement is you need to hop data rows through the filesystem, and as both the sentence above and the question title make clear, you are currently doing that with a csv file. Trouble is that csv format offers poor support for the RDBMS notion of NULL. Let me solve your problem for you by changing the question slightly. I'd like to introduce you to parquet format. Given a set of table rows in memory, it allows you to very quickly persist them to a compressed binary file, and recover them, with metadata and NULLs intact, no text quoting hassles. Here is an example, using the pyarrow 0.12.1 parquet engine:

import pandas as pd
import pyarrow


def round_trip(fspec='/tmp/locations.parquet'):
    rows = [
        dict(lat=42.313, lng=-71.116),
        dict(lat=42.377, lng=-71.065),
        dict(lat=None, lng=None),
    ]

    df = pd.DataFrame(rows)
    df.to_parquet(fspec)
    del(df)

    df2 = pd.read_parquet(fspec)
    print(df2)


if __name__ == '__main__':
    round_trip()

Output:

      lat     lng
0  42.313 -71.116
1  42.377 -71.065
2     NaN     NaN

Once you've recovered the rows in a dataframe you're free to call df2.to_sql() or use some other favorite technique to put numbers and NULLs into a DB table.

EDIT:

If you're able to run .to_sql() on the PG server, or on same LAN, then do that. Otherwise your favorite technique will likely involve .copy_expert(). Why? The summary is that with psycopg2, "bulk INSERT is slow". Middle layers like sqlalchemy and pandas, and well-written apps that care about insert performance, will use .executemany(). The idea is to send lots of rows all at once, without waiting for individual result status, because we're not worried about unique index violations. So TCP gets a giant buffer of SQL text and sends it all at once, saturating the end-to-end channel's bandwidth, much as copy_expert sends a big buffer to TCP to achieve high bandwidth.

In contrast the psycopg2 driver lacks support for high performance executemany. As of 2.7.4 it just executes items one at a time, sending a SQL command across the WAN and waiting a round trip time for the result before sending next command. Ping your server; if ping times suggest you could get a dozen round trips per second, then plan on only inserting about a dozen rows per second. Most of the time is spent waiting for a reply packet, rather than spent processing DB rows. It would be lovely if at some future date psycopg2 would offer better support for this.

answered Oct 13 '22 10:10

J_H

Related questions
                            
                                Sorting a zipped object in python 3 [duplicate]
                            
                                How to use custom password validators beside the django auth password validators?
                            
                                Skimage - Weird results of resize function
                            
                                PySide2 on windows
                            
                                In pandas, how to concatenate horizontally and then remove the redundant columns
                            
                                python: class vs tuple huge memory overhead (?)
                            
                                How do I properly base64 encode a MIMEText for Gmail API?
                            
                                How to ensure that README.rst is valid?
                            
                                Can I display image in full screen mode with PIL?
                            
                                Django OneToOneField with possible blank field
                            
                                Facebook Marketing API - Python to get Insights - User Request Limit Reached
                            
                                Why am I getting this error in scrapy - python3.7 invalid syntax
                            
                                What does it mean for an attribute name to end in an underscore?
                            
                                Python requests - check if a particular header exists
                            
                                convert pandas dataframe to json object - pandas
                            
                                Django: Custom User Model fields not appearing in Django admin
                            
                                Sorting according to clockwise point coordinates
                            
                                OpenCV: error: (-215:Assertion failed) _src.type() == CV_8UC1 in function 'cv::equalizeHist'
                            
                                PySpark: Subtract Two Timestamp Columns and Give Back Difference in Minutes (Using F.datediff gives back only whole days)
                            
                                Prevent Visual Studio Code from activating the Python virtual environment

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to keep null values when writing to csv

Tags:

python

python-3.x

csv

postgresql

Jonathan Porter

People also ask

4 Answers

Python `csv.writer()` and quoting

PostgreSQL 9.4 `COPY FROM`, `NULL` values and `FORCE_NULL`

Other options to consider

For simple data transformation tasks from other databases, don't use Python

Numpy data? Consider using COPY FROM as binary, directly from Python

Persisting data to handle large datasets in a pipeline?

Martijn Pieters

Jean-François Fabre

Test Code:

Results:

Stephen Rauch

J_H

Recent Activity

Donate For Us

How to keep null values when writing to csv

Tags:

python

python-3.x

csv

postgresql

Jonathan Porter

People also ask

4 Answers

Python csv.writer() and quoting

PostgreSQL 9.4 COPY FROM, NULL values and FORCE_NULL

Other options to consider

For simple data transformation tasks from other databases, don't use Python

Numpy data? Consider using COPY FROM as binary, directly from Python

Persisting data to handle large datasets in a pipeline?

Martijn Pieters

Jean-François Fabre

Test Code:

Results:

Stephen Rauch

J_H

Related questions

Recent Activity

Donate For Us

Python `csv.writer()` and quoting

PostgreSQL 9.4 `COPY FROM`, `NULL` values and `FORCE_NULL`