How to use Scala DataFrameReader option method

Question

The Scala DataFrameReader has a function "option" which has the following signature:

 def  option(key: String, value: String): DataFrameReader 
   // Adds an input option for the underlying data source.

So what is an "input option" for the underlying data source, can someone share an example here on how to use this function?

Daniel Darabos · Accepted Answer

The list of available options varies by the file format. They are documented in the DataFrameReader API docs.

For example:

def json(paths: String*): DataFrame

Loads a JSON file (one object per line) and returns the result as a DataFrame.

This function goes through the input once to determine the input schema. If you know the schema in advance, use the version that specifies the schema to avoid the extra scan.

You can set the following JSON-specific options to deal with non-standard JSON files:

primitivesAsString (default false): infers all primitive values as a string type

prefersDecimal (default false): infers all floating-point values as a decimal type. If the values do not fit in decimal, then it infers them as doubles.

allowComments (default false): ignores Java/C++ style comment in JSON records

allowUnquotedFieldNames (default false): allows unquoted JSON field names

allowSingleQuotes (default true): allows single quotes in addition to double quotes

allowNumericLeadingZeros (default false): allows leading zeros in numbers (e.g. 00012)

allowBackslashEscapingAnyCharacter (default false): allows accepting quoting of all character using backslash quoting mechanism

mode (default PERMISSIVE): allows a mode for dealing with corrupt records during parsing.

PERMISSIVE: sets other fields to null when it meets a corrupted record, and puts the malformed string into a new field configured by columnNameOfCorruptRecord. When a schema is set by user, it sets null for extra fields.

DROPMALFORMED: ignores the whole corrupted records.

FAILFAST: throws an exception when it meets corrupted records.

columnNameOfCorruptRecord (default is the value specified in spark.sql.columnNameOfCorruptRecord): allows renaming the new field having malformed string created by PERMISSIVE mode. This overrides spark.sql.columnNameOfCorruptRecord.

How to use Scala DataFrameReader option method

Tags:

scala

apache-spark

jlp

1 Answers

Daniel Darabos

Recent Activity

Donate For Us

How to use Scala DataFrameReader option method

Tags:

scala

apache-spark

jlp

1 Answers

Daniel Darabos

Related questions

Recent Activity

Donate For Us