I would like to remove the double-quotes "" when the value is empty when registering my CSV in S3 with the method df.write.save() of Spark
Spark Version : 2.4.0
Python Version : 3.6.5
Here is my code in Python for load the csv file:
df = spark.read.load(
path('in'),
format = 'csv',
delimiter = '|',
encoding = 'utf-8',
header = 'true'
)
The loaded CSV file:
|id|first_name|last_name|zip_code|
|1 | |Elsner |57315 |
|2 |Noelle | | |
|3 |James |Moser |48256 |
Here is my code in Python for write the csv file:
df.write.save(
path('out'),
format = 'csv',
delimiter = '|',
header = 'true'
)
The CSV file written:
|id|first_name|last_name|zip_code|
|1 |"" |Elsner |57315 |
|2 |Noelle |"" |"" |
|3 |James |Moser |48256 |
How to remove the double quote when registering?
Thank you very much by advance.
According to Spark documentation, the default value of both nullValue
and emptyValue
is None
, and will result in an empty string.
To set it to actual nothing, as you desire, you can set it to the Unicode NULL
:
df.write.save(
path('out'),
format = 'csv',
delimiter = '|',
header = True,
nullValue = '\u0000',
emptyValue = '\u0000'
)
none of the solutions above worked for me, but they were helpful to guide me to the right direction. What ended up working was adding the option emptyValue=u'\u0000'
(when I tried without the preceding u
it ended up writing \u0000
literary to the csv file)
If you're looking for the PySpark way of doing this, don't try to use the null string trick! It's much more straight-forward (once you know the trick...)
myDF.coalesce(1).write\
.option("emptyValue", None)\
.option("nullValue", None)\
.csv(outFile)
Hope it helps! Couldn't find it documented anywhere
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With