Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Where is the reference for options for writing or reading per format?

I use Spark 1.6.1.

We are trying to write an ORC file to HDFS using HiveContext and DataFrameWriter. While we can use

df.write().orc(<path>)

we would rather do something like

df.write().options(Map("format" -> "orc", "path" -> "/some_path")

This is so that we have the flexibility to change the format or root path depending on the application that uses this helper library. Where can we find a reference to the options that can be passed into the DataFrameWriter? I found nothing in the docs here

https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/sql/DataFrameWriter.html#options(java.util.Map)

like image 847
Satyam Avatar asked Jun 05 '17 08:06

Satyam


1 Answers

Where can we find a reference to the options that can be passed into the DataFrameWriter?

The most definitive and authoritative answer are the sources:

  • CSVOptions
  • JDBCOptions
  • JSONOptions
  • ParquetOptions
  • TextOptions
  • OrcOptions
  • ...

Some description you may find in the docs, but there is no single page (that could possibly be auto-generated from the sources to stay up-to-date the most).

The reason being that the options are separated from the format implementation on purpose to have the flexibility you want to offer per use case (as you duly noted):

This is so that we have the flexibility to change the format or root path depending on the application that uses this helper library.


Your question seems similar to How to know the file formats supported by Databricks? where I said:

Where can I get the list of options supported for each file format?

That's not possible as there is no API to follow (like in Spark MLlib) to define options. Every format does this on its own...unfortunately and your best bet is to read the documentation or (more authoritative) the source code.

like image 159
Jacek Laskowski Avatar answered Nov 02 '22 19:11

Jacek Laskowski