Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove the double quote when the value is empty in Spark?

I would like to remove the double-quotes "" when the value is empty when registering my CSV in S3 with the method df.write.save() of Spark

Spark Version : 2.4.0

Python Version : 3.6.5

Here is my code in Python for load the csv file:

df = spark.read.load(
    path('in'),
    format = 'csv',
    delimiter = '|',
    encoding = 'utf-8',
    header = 'true'
)

The loaded CSV file:

|id|first_name|last_name|zip_code|
|1 |          |Elsner   |57315   |
|2 |Noelle    |         |        |
|3 |James     |Moser    |48256   |

Here is my code in Python for write the csv file:

df.write.save(
    path('out'),
    format = 'csv',
    delimiter = '|',
    header = 'true'
)

The CSV file written:

|id|first_name|last_name|zip_code|
|1 |""        |Elsner   |57315   |
|2 |Noelle    |""       |""      |
|3 |James     |Moser    |48256   |

How to remove the double quote when registering?

Thank you very much by advance.

like image 612
Yoann Di Maria Avatar asked Feb 18 '19 15:02

Yoann Di Maria


3 Answers

According to Spark documentation, the default value of both nullValue and emptyValue is None, and will result in an empty string. To set it to actual nothing, as you desire, you can set it to the Unicode NULL:

df.write.save(
    path('out'),
    format = 'csv',
    delimiter = '|',
    header = True,
    nullValue = '\u0000',
    emptyValue = '\u0000'
)
like image 121
harppu Avatar answered Nov 15 '22 04:11

harppu


none of the solutions above worked for me, but they were helpful to guide me to the right direction. What ended up working was adding the option emptyValue=u'\u0000' (when I tried without the preceding u it ended up writing \u0000 literary to the csv file)

like image 34
Carlos Moro Avatar answered Nov 15 '22 02:11

Carlos Moro


If you're looking for the PySpark way of doing this, don't try to use the null string trick! It's much more straight-forward (once you know the trick...)

myDF.coalesce(1).write\
    .option("emptyValue", None)\
    .option("nullValue", None)\
    .csv(outFile)

Hope it helps! Couldn't find it documented anywhere

like image 29
mc matt g Avatar answered Nov 15 '22 04:11

mc matt g