Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to save CSV with all fields quoted?

The below code does not add the double quotes which is the default. I also tried adding # and single quote using option quote with no success. I also used quoteMode with ALL and NON_NUMERIC options, still no change in the output.

s2d.coalesce(64).write
  .format("com.databricks.spark.csv")
  .option("header", "false")
  .save(fname)

Are there any other options I can try? I am using spark-csv 2.11 over spark 2.1.

Output it produces:

d4c354ef,2017-03-14 16:31:33,2017-03-14 16:31:46,104617772177,340618697

Output I am looking for:

“d4c354ef”,”2017-03-14 16:31:33”,”2017-03-14 16:31:46”,104617772177,340618697  
like image 816
Arvind Kandaswamy Avatar asked Apr 26 '17 20:04

Arvind Kandaswamy


People also ask

How do I stop excel from adding unwanted quotation marks to my exported CSV file?

To do this, select the column of data that has the extra quote marks, then go to the “Data” tab and click “Text to Columns.” In the “Text to Columns” wizard, select “Delimited” and click “Next.” Then, uncheck the “Tab” option and check the “Other” option.

Can CSV files have quotes?

Yes. You can import double quotation marks using CSV files and import maps by escaping the double quotation marks. To escape the double quotation marks, enclose them within another double quotation mark.


2 Answers

tl;dr Enable quoteAll option.

scala> Seq(("hello", 5)).toDF.write.option("quoteAll", true).csv("hello5.csv")

The above gives the following output:

$ cat hello5.csv/part-00000-a0ecb4c2-76a9-4e08-9c54-6a7922376fe6-c000.csv
"hello","5"

That assumes the quote is " (see CSVOptions)

That however won't give you "Double quotes around all non-numeric characters." Sorry.

You can see all the options in CSVOptions that serves as the source of the options for the CSV reader and writer.

p.s. com.databricks.spark.csv is currently a mere alias for csv format. You can use both interchangeably, but the shorter csv is preferred.

p.s. Use option("header", false) (false as boolean not String) that will make your code slightly more type-safe.

like image 115
Jacek Laskowski Avatar answered Oct 02 '22 16:10

Jacek Laskowski


In Spark 2.1 where the old CSV library has been inlined, I do not see any option for what you want in the csv method of DataFrameWriter as seen here.

So I guess you have to map over your data "manually" to determine which of the Row components are non-numbers and quote them accordingly. You could utilize a straightforward isNumeric helper function like this:

def isNumeric(s: String) = s.nonEmpty && s.forall(Character.isDigit)

As you map over your DataSet, quote the values where isNumeric is false.

like image 28
Vidya Avatar answered Oct 02 '22 16:10

Vidya