Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Available options in the spark.read.option()

Tags:

When I read other people's python code, like, spark.read.option("mergeSchema", "true"), it seems that the coder has already known what the parameters to use. But for a starter, is there a place to look up those available parameters? I look up the apche documents and it shows parameter undocumented.

Thanks.

like image 932
Tim.X Avatar asked Sep 24 '18 05:09

Tim.X


People also ask

What is option in Spark?

The core syntax for reading data in Apache Spark format — specifies the file format as in CSV, JSON, or parquet. The default is parquet. option — a set of key-value configurations to parameterize how to read data. schema — optional one used to specify if you would like to infer the schema from the data source.

What does Spark read load do?

It does nothing. It is just part of the sqlContext. read as a parameter, that you did not set directly on the read. read allows data formats to be specified.

How to read multiple JSON files from different paths in spark?

Using spark.read.option ("multiline","true") 3. Reading Multiple Files at a Time Using the spark.read.json () method you can also read multiple JSON files from different paths, just pass all file names with fully qualified paths by separating comma, for example 4. Reading all Files in a Directory

What file formats does spark support to read?

Note: Spark out of the box supports to read files in CSV, JSON, TEXT, Parquet, and many more file formats into Spark DataFrame.

How to read multiple CSV files in spark?

Using the spark.read.csv () method you can also read multiple CSV files, just pass all file names by separating comma as a path, for example : We can read all CSV files from a directory into DataFrame just by passing the directory as a path to the csv () method. Spark CSV dataset provides multiple options to work with CSV files.

What is set to any other character in spark?

Set to any other character. Separator character within the quote will be ignored Set to any other character. True, if want to load files having multiline. These Options are generally used while reading files in Spark.


2 Answers

Annoyingly, the documentation for the option method is in the docs for the json method. The docs on that method say the options are as follows (key -- value -- description):

  • primitivesAsString -- true/false (default false) -- infers all primitive values as a string type

  • prefersDecimal -- true/false (default false) -- infers all floating-point values as a decimal type. If the values do not fit in decimal, then it infers them as doubles.

  • allowComments -- true/false (default false) -- ignores Java/C++ style comment in JSON records

  • allowUnquotedFieldNames -- true/false (default false) -- allows unquoted JSON field names

  • allowSingleQuotes -- true/false (default true) -- allows single quotes in addition to double quotes

  • allowNumericLeadingZeros -- true/false (default false) -- allows leading zeros in numbers (e.g. 00012)

  • allowBackslashEscapingAnyCharacter -- true/false (default false) -- allows accepting quoting of all character using backslash quoting mechanism

  • allowUnquotedControlChars -- true/false (default false) -- allows JSON Strings to contain unquoted control characters (ASCII characters with value less than 32, including tab and line feed characters) or not.

  • mode -- PERMISSIVE/DROPMALFORMED/FAILFAST (default PERMISSIVE) -- allows a mode for dealing with corrupt records during parsing.

    • PERMISSIVE : when it meets a corrupted record, puts the malformed string into a field configured by columnNameOfCorruptRecord, and sets other fields to null. To keep corrupt records, an user can set a string type field named columnNameOfCorruptRecord in an user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. When inferring a schema, it implicitly adds a columnNameOfCorruptRecord field in an output schema.
    • DROPMALFORMED : ignores the whole corrupted records.
    • FAILFAST : throws an exception when it meets corrupted records.
like image 99
csjacobs24 Avatar answered Nov 30 '22 07:11

csjacobs24


For built-in formats all options are enumerated in the official documentation. Each format has its own set of option, so you have to refer to the one you use.

  • For read open docs for DataFrameReader and expand docs for individual methods. Let's say for JSON format expand json method (only one variant contains full list of options)

    json options

  • For write open docs for DataFrameWriter. For example for Parquet:

    parquet options

However merging schema is performed not via options, but using session properties

 spark.conf.set("spark.sql.parquet.mergeSchema", "true") 
like image 45
user10407081 Avatar answered Nov 30 '22 07:11

user10407081