read.json only reading the first object in Spark

Question

I have a multiLine json file, and I am using spark's read.json to read the json, the problem is that it is only reading the first object from that json file

val dataFrame = spark.read.option("multiLine", true).option("mode", "PERMISSIVE").json(path)
dataFrame.rdd.saveAsTextFile("DataFrame")

Sample json:

{
    "_id" : "589895e123c572923e69f5e7",
    "thing" : "54eb45beb5f1e061454c5bf4",
    "timeline" : [ 
        {
            "reason" : "TRIP_START",
            "timestamp" : "2017-02-06T17:20:18.007+02:00",
            "type" : "TRIP_EVENT",
            "location" : [ 
                11.1174091, 
                69.1174091
            ],
            "endLocation" : [],
            "startLocation" : []
        }, 
            "reason" : "TRIP_END",
            "timestamp" : "2017-02-06T17:25:26.026+02:00",
            "type" : "TRIP_EVENT",
            "location" : [ 
                11.5691428, 
                48.1122443
            ],
            "endLocation" : [],
            "startLocation" : []
        }
    ],
    "__v" : 0
}
{
    "_id" : "589895e123c572923e69f5e8",
    "thing" : "54eb45beb5f1e032241c5bf4",
    "timeline" : [ 
        {
            "reason" : "TRIP_START",
            "timestamp" : "2017-02-06T17:20:18.007+02:00",
            "type" : "TRIP_EVENT",
            "location" : [ 
                11.1174091, 
                50.1174091
            ],
            "endLocation" : [],
            "startLocation" : []
        }, 
            "reason" : "TRIP_END",
            "timestamp" : "2017-02-06T17:25:26.026+02:00",
            "type" : "TRIP_EVENT",
            "location" : [ 
                51.1174091, 
                69.1174091
            ],
            "endLocation" : [],
            "startLocation" : []
        }
    ],
    "__v" : 0
}

I get only the first entry with id = 589895e123c572923e69f5e7.

Is there something that I am doing wrong?

OneCricketeer · Accepted Answer

Are you sure multiple multi line JSON is supported?

Each line must contain a separate, self-contained valid JSON object... For a regular multi-line JSON file, set the multiLine option to true

http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets

Where a "regular JSON file" means the entire file is a singular JSON object / array, however, simply putting {} around your data won't work because you need a key for every object, and so you'd need a top level key, maybe say "objects". Similarly, you can try an array, but wrapping with []. Either way, these will only work if every object in that array or object is separated by commas.

tl;dr - the whole file needs to be one valid JSON object when multiline=true

You're only getting one object because it parses the first set of brackets, and that's it.

If you have full control over the JSON file, the indented layout is purely for human consumption. Just flatten the objects and let Spark parse it as the API is intended to be used

PerkinsZhu · Answer

Keep one line and one JsValue in file, remove .option("multiLine", true). like this:

{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}

enter image description here

read.json only reading the first object in Spark

Tags:

json

scala

apache-spark

atalpha

2 Answers

OneCricketeer

PerkinsZhu

Recent Activity

Donate For Us

read.json only reading the first object in Spark

Tags:

json

scala

apache-spark

atalpha

2 Answers

OneCricketeer

PerkinsZhu

Related questions

Recent Activity

Donate For Us