I have a <code>json</code> file, <code>nodes</code> that looks like this: <pre class="prettyprint"><code>[{"toid":"osgb4000000031043205","point":[508180.748,195333.973],"index":1} ,{"toid":"osgb4000000031043206","point":[508163.122,195316.627],"index":2} ,{"toid":"osgb4000000031043207","point":[508172.075,195325.719],"index":3} ,{"toid":"osgb4000000031043208","point":[508513,196023],"index":4}] </code></pre> I am able to read and manipulate this record with Python. I am trying to read this file in <code>scala</code> through the <code>spark-shell</code>. From this tutorial, I can see that it is possible to read <code>json</code> via <code>sqlContext.read.json</code> <pre class="prettyprint"><code>val vfile = sqlContext.read.json("path/to/file/nodes.json") </code></pre> However, this results in a <code>corrupt_record</code> error: <pre class="prettyprint"><code>vfile: org.apache.spark.sql.DataFrame = [_corrupt_record: string] </code></pre> Can anyone shed some light on this error? I can read and use the file with other applications and I am confident it is not corrupt and sound <code>json</code>.

As Spark expects "JSON Line format" not a typical JSON format, we can tell spark to read typical JSON by specifying: <pre class="prettyprint"><code>val df = spark.read.option("multiline", "true").json("<file>") </code></pre>

Spark cannot read JSON-array to a record on top-level, so you have to pass: <pre class="prettyprint"><code>{"toid":"osgb4000000031043205","point":[508180.748,195333.973],"index":1} {"toid":"osgb4000000031043206","point":[508163.122,195316.627],"index":2} {"toid":"osgb4000000031043207","point":[508172.075,195325.719],"index":3} {"toid":"osgb4000000031043208","point":[508513,196023],"index":4} </code></pre> As it's described in the tutorial you're referring to: <blockquote> Let's begin by loading a JSON file, where each line is a JSON object </blockquote> The reasoning is quite simple. Spark expects you to pass a file with a lot of JSON-entities (entity per line), so it could distribute their processing (per entity, roughly saying). To put more light on it, here is a quote form the official doc <blockquote> Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail. </blockquote> This format is called JSONL. Basically it's an alternative to CSV.

Reading JSON with Apache Spark - `corrupt_record`

Tags:

I have a json file, nodes that looks like this:

[{"toid":"osgb4000000031043205","point":[508180.748,195333.973],"index":1} ,{"toid":"osgb4000000031043206","point":[508163.122,195316.627],"index":2} ,{"toid":"osgb4000000031043207","point":[508172.075,195325.719],"index":3} ,{"toid":"osgb4000000031043208","point":[508513,196023],"index":4}]

I am able to read and manipulate this record with Python.

I am trying to read this file in scala through the spark-shell.

From this tutorial, I can see that it is possible to read json via sqlContext.read.json

val vfile = sqlContext.read.json("path/to/file/nodes.json")

However, this results in a corrupt_record error:

vfile: org.apache.spark.sql.DataFrame = [_corrupt_record: string]

Can anyone shed some light on this error? I can read and use the file with other applications and I am confident it is not corrupt and sound json.

695

asked Aug 11 '16 11:08

LearningSlowly

2 Answers

As Spark expects "JSON Line format" not a typical JSON format, we can tell spark to read typical JSON by specifying:

val df = spark.read.option("multiline", "true").json("<file>")

answered Oct 06 '22 02:10

SandeepGodara

Spark cannot read JSON-array to a record on top-level, so you have to pass:

{"toid":"osgb4000000031043205","point":[508180.748,195333.973],"index":1}  {"toid":"osgb4000000031043206","point":[508163.122,195316.627],"index":2}  {"toid":"osgb4000000031043207","point":[508172.075,195325.719],"index":3}  {"toid":"osgb4000000031043208","point":[508513,196023],"index":4}

As it's described in the tutorial you're referring to:

Let's begin by loading a JSON file, where each line is a JSON object

The reasoning is quite simple. Spark expects you to pass a file with a lot of JSON-entities (entity per line), so it could distribute their processing (per entity, roughly saying).

To put more light on it, here is a quote form the official doc

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.

This format is called JSONL. Basically it's an alternative to CSV.

answered Oct 06 '22 03:10

dk14

Related questions
                            
                                ListView in BottomSheet
                            
                                Access vue vuex namespaced getter from template
                            
                                ValueError: Cannot run multiple SparkContexts at once in spark with pyspark
                            
                                Julia: Flattening array of array/tuples
                            
                                Project builds but can't publish
                            
                                macOS 10.14(beta) How to install Command_Line_Tools_macOS_10.14_for_Xcode_10_Beta
                            
                                Unarchive Array with NSKeyedUnarchiver unarchivedObject(ofClass:from:)
                            
                                Implementing a "rules engine" in Python
                            
                                Windows equivalent of the Mac OS X “open” command
                            
                                MSBuild error while compiling ASP.NET website
                            
                                for and while loop in c#
                            
                                dynamic return type of a function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With