How to remove the double quote when the value is empty in Spark?

Question

I would like to remove the double-quotes "" when the value is empty when registering my CSV in S3 with the method df.write.save() of Spark

Spark Version : 2.4.0

Python Version : 3.6.5

Here is my code in Python for load the csv file:

df = spark.read.load(
    path('in'),
    format = 'csv',
    delimiter = '|',
    encoding = 'utf-8',
    header = 'true'
)

The loaded CSV file:

|id|first_name|last_name|zip_code|
|1 |          |Elsner   |57315   |
|2 |Noelle    |         |        |
|3 |James     |Moser    |48256   |

Here is my code in Python for write the csv file:

df.write.save(
    path('out'),
    format = 'csv',
    delimiter = '|',
    header = 'true'
)

The CSV file written:

|id|first_name|last_name|zip_code|
|1 |""        |Elsner   |57315   |
|2 |Noelle    |""       |""      |
|3 |James     |Moser    |48256   |

How to remove the double quote when registering?

Thank you very much by advance.

harppu · Accepted Answer

According to Spark documentation, the default value of both nullValue and emptyValue is None, and will result in an empty string. To set it to actual nothing, as you desire, you can set it to the Unicode NULL:

df.write.save(
    path('out'),
    format = 'csv',
    delimiter = '|',
    header = True,
    nullValue = '\u0000',
    emptyValue = '\u0000'
)

Carlos Moro · Answer

none of the solutions above worked for me, but they were helpful to guide me to the right direction. What ended up working was adding the option emptyValue=u'\u0000' (when I tried without the preceding u it ended up writing \u0000 literary to the csv file)

mc matt g · Answer

If you're looking for the PySpark way of doing this, don't try to use the null string trick! It's much more straight-forward (once you know the trick...)

myDF.coalesce(1).write\
    .option("emptyValue", None)\
    .option("nullValue", None)\
    .csv(outFile)

Hope it helps! Couldn't find it documented anywhere

How to remove the double quote when the value is empty in Spark?

Tags:

python

dataframe

csv

pyspark

Yoann Di Maria

3 Answers

harppu

Carlos Moro

mc matt g

Recent Activity

Donate For Us

How to remove the double quote when the value is empty in Spark?

Tags:

python

dataframe

csv

pyspark

Yoann Di Maria

3 Answers

harppu

Carlos Moro

mc matt g

Related questions

Recent Activity

Donate For Us