Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use Scala DataFrameReader option method

The Scala DataFrameReader has a function "option" which has the following signature:

 def  option(key: String, value: String): DataFrameReader 
   // Adds an input option for the underlying data source. 

So what is an "input option" for the underlying data source, can someone share an example here on how to use this function?

like image 791
jlp Avatar asked Mar 22 '16 19:03

jlp


1 Answers

The list of available options varies by the file format. They are documented in the DataFrameReader API docs.

For example:

def json(paths: String*): DataFrame

Loads a JSON file (one object per line) and returns the result as a DataFrame.

This function goes through the input once to determine the input schema. If you know the schema in advance, use the version that specifies the schema to avoid the extra scan.

You can set the following JSON-specific options to deal with non-standard JSON files:

  • primitivesAsString (default false): infers all primitive values as a string type
  • prefersDecimal (default false): infers all floating-point values as a decimal type. If the values do not fit in decimal, then it infers them as doubles.
  • allowComments (default false): ignores Java/C++ style comment in JSON records
  • allowUnquotedFieldNames (default false): allows unquoted JSON field names
  • allowSingleQuotes (default true): allows single quotes in addition to double quotes
  • allowNumericLeadingZeros (default false): allows leading zeros in numbers (e.g. 00012)
  • allowBackslashEscapingAnyCharacter (default false): allows accepting quoting of all character using backslash quoting mechanism
  • mode (default PERMISSIVE): allows a mode for dealing with corrupt records during parsing.
    • PERMISSIVE: sets other fields to null when it meets a corrupted record, and puts the malformed string into a new field configured by columnNameOfCorruptRecord. When a schema is set by user, it sets null for extra fields.
    • DROPMALFORMED: ignores the whole corrupted records.
    • FAILFAST: throws an exception when it meets corrupted records.
  • columnNameOfCorruptRecord (default is the value specified in spark.sql.columnNameOfCorruptRecord): allows renaming the new field having malformed string created by PERMISSIVE mode. This overrides spark.sql.columnNameOfCorruptRecord.
like image 70
Daniel Darabos Avatar answered Nov 04 '22 02:11

Daniel Darabos