Read multiline JSON in Apache Spark

Tags:

I was trying to use a JSON file as a small DB. After creating a template table on DataFrame I queried it with SQL and got an exception. Here is my code:

val df = sqlCtx.read.json("/path/to/user.json") df.registerTempTable("user_tt")  val info = sqlCtx.sql("SELECT name FROM user_tt") info.show()

df.printSchema() result:

root  |-- _corrupt_record: string (nullable = true)

My JSON file:

{   "id": 1,   "name": "Morty",   "age": 21 }

Exeption:

Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'name' given input columns: [_corrupt_record];

How can I fix it?

UPD

_corrupt_record is

+--------------------+ |     _corrupt_record| +--------------------+ |                   {| |            "id": 1,| |    "name": "Morty",| |           "age": 21| |                   }| +--------------------+

UPD2

It's weird, but when I rewrite my JSON to make it oneliner, everything works fine.

{"id": 1, "name": "Morty", "age": 21}

So the problem is in a newline.

UPD3

I found in docs the next sentence:

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.

It isn't convenient to keep JSON in such format. Is there any workaround to get rid of multi-lined structure of JSON or to convert it in oneliner?

719

asked Jul 23 '16 19:07

Finkelson

1 Answers

Spark >= 2.2

Spark 2.2 introduced ~~wholeFile~~ multiLine option which can be used to load JSON (not JSONL) files:

spark.read   .option("multiLine", true).option("mode", "PERMISSIVE")   .json("/path/to/user.json")

See:

SPARK-18352 - Parse normal, multi-line JSON files (not just JSON Lines).
SPARK-20980 - Rename the option wholeFile to multiLine for JSON and CSV.

Spark < 2.2

Well, using JSONL formated data may be inconvenient but it I will argue that is not the issue with API but the format itself. JSON is simply not designed to be processed in parallel in distributed systems.

It provides no schema and without making some very specific assumptions about its formatting and shape it is almost impossible to correctly identify top level documents. Arguably this is the worst possible format to imagine to use in systems like Apache Spark. It is also quite tricky and typically impractical to write valid JSON in distributed systems.

That being said, if individual files are valid JSON documents (either single document or an array of documents) you can always try wholeTextFiles:

spark.read.json(sc.wholeTextFiles("/path/to/user.json").values())

187

answered Sep 21 '22 04:09

zero323

Related questions
                            
                                SQLite JSON1 example for JSON extract\set
                            
                                How to omit Get only properties in servicestack json serializer?
                            
                                Read JSON file in Objective C [closed]
                            
                                What JSON library works well for you in .NET? [closed]
                            
                                how to configure spring mvc 3 to not return "null" object in json response?
                            
                                How do I make JSON.NET ignore object relationships?
                            
                                Accessing a JSON object in Bash - associative array / list / another model
                            
                                Rails: format.js or format.json, or both?
                            
                                JQuery, send JSON object using GET method
                            
                                Replacing escape characters from JSON
                            
                                UTF-8 character encoding battles json_encode() [duplicate]
                            
                                Use different value from JSON data instead of displayKey using Typeahead
                            
                                How do I convert a LinkedTreeMap to gson JsonObject
                            
                                POST JSON over CURL with basic authentication
                            
                                Using jq, convert array of objects to object with named keys
                            
                                How can I parse JSON string from HttpClient?
                            
                                Laravel5 Json get file contents
                            
                                Python Flask-Restful POST not taking JSON arguments
                            
                                Google GCM server returns Unauthorized Error 401
                            
                                Flask request and application/json content type

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Read multiline JSON in Apache Spark

Tags:

json

apache-spark

apache-spark-sql

Finkelson

People also ask

1 Answers

zero323

Recent Activity

Donate For Us