Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading JSON with Apache Spark - `corrupt_record`

Tags:

I have a json file, nodes that looks like this:

[{"toid":"osgb4000000031043205","point":[508180.748,195333.973],"index":1} ,{"toid":"osgb4000000031043206","point":[508163.122,195316.627],"index":2} ,{"toid":"osgb4000000031043207","point":[508172.075,195325.719],"index":3} ,{"toid":"osgb4000000031043208","point":[508513,196023],"index":4}] 

I am able to read and manipulate this record with Python.

I am trying to read this file in scala through the spark-shell.

From this tutorial, I can see that it is possible to read json via sqlContext.read.json

val vfile = sqlContext.read.json("path/to/file/nodes.json") 

However, this results in a corrupt_record error:

vfile: org.apache.spark.sql.DataFrame = [_corrupt_record: string] 

Can anyone shed some light on this error? I can read and use the file with other applications and I am confident it is not corrupt and sound json.

like image 695
LearningSlowly Avatar asked Aug 11 '16 11:08

LearningSlowly


People also ask

Can Spark read JSON files?

Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame. using the read. json() function, which loads data from a directory of JSON files where each line of the files is a JSON object. Note that the file that is offered as a json file is not a typical JSON file.


2 Answers

As Spark expects "JSON Line format" not a typical JSON format, we can tell spark to read typical JSON by specifying:

val df = spark.read.option("multiline", "true").json("<file>") 
like image 85
SandeepGodara Avatar answered Oct 06 '22 02:10

SandeepGodara


Spark cannot read JSON-array to a record on top-level, so you have to pass:

{"toid":"osgb4000000031043205","point":[508180.748,195333.973],"index":1}  {"toid":"osgb4000000031043206","point":[508163.122,195316.627],"index":2}  {"toid":"osgb4000000031043207","point":[508172.075,195325.719],"index":3}  {"toid":"osgb4000000031043208","point":[508513,196023],"index":4} 

As it's described in the tutorial you're referring to:

Let's begin by loading a JSON file, where each line is a JSON object

The reasoning is quite simple. Spark expects you to pass a file with a lot of JSON-entities (entity per line), so it could distribute their processing (per entity, roughly saying).

To put more light on it, here is a quote form the official doc

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.

This format is called JSONL. Basically it's an alternative to CSV.

like image 25
dk14 Avatar answered Oct 06 '22 03:10

dk14