Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

read.json only reading the first object in Spark

I have a multiLine json file, and I am using spark's read.json to read the json, the problem is that it is only reading the first object from that json file

val dataFrame = spark.read.option("multiLine", true).option("mode", "PERMISSIVE").json(path)
dataFrame.rdd.saveAsTextFile("DataFrame")

Sample json:

{
    "_id" : "589895e123c572923e69f5e7",
    "thing" : "54eb45beb5f1e061454c5bf4",
    "timeline" : [ 
        {
            "reason" : "TRIP_START",
            "timestamp" : "2017-02-06T17:20:18.007+02:00",
            "type" : "TRIP_EVENT",
            "location" : [ 
                11.1174091, 
                69.1174091
            ],
            "endLocation" : [],
            "startLocation" : []
        }, 
            "reason" : "TRIP_END",
            "timestamp" : "2017-02-06T17:25:26.026+02:00",
            "type" : "TRIP_EVENT",
            "location" : [ 
                11.5691428, 
                48.1122443
            ],
            "endLocation" : [],
            "startLocation" : []
        }
    ],
    "__v" : 0
}
{
    "_id" : "589895e123c572923e69f5e8",
    "thing" : "54eb45beb5f1e032241c5bf4",
    "timeline" : [ 
        {
            "reason" : "TRIP_START",
            "timestamp" : "2017-02-06T17:20:18.007+02:00",
            "type" : "TRIP_EVENT",
            "location" : [ 
                11.1174091, 
                50.1174091
            ],
            "endLocation" : [],
            "startLocation" : []
        }, 
            "reason" : "TRIP_END",
            "timestamp" : "2017-02-06T17:25:26.026+02:00",
            "type" : "TRIP_EVENT",
            "location" : [ 
                51.1174091, 
                69.1174091
            ],
            "endLocation" : [],
            "startLocation" : []
        }
    ],
    "__v" : 0
}

I get only the first entry with id = 589895e123c572923e69f5e7.

Is there something that I am doing wrong?

like image 805
atalpha Avatar asked Mar 07 '23 12:03

atalpha


2 Answers

Are you sure multiple multi line JSON is supported?

Each line must contain a separate, self-contained valid JSON object... For a regular multi-line JSON file, set the multiLine option to true

http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets

Where a "regular JSON file" means the entire file is a singular JSON object / array, however, simply putting {} around your data won't work because you need a key for every object, and so you'd need a top level key, maybe say "objects". Similarly, you can try an array, but wrapping with []. Either way, these will only work if every object in that array or object is separated by commas.

tl;dr - the whole file needs to be one valid JSON object when multiline=true

You're only getting one object because it parses the first set of brackets, and that's it.

If you have full control over the JSON file, the indented layout is purely for human consumption. Just flatten the objects and let Spark parse it as the API is intended to be used

like image 133
OneCricketeer Avatar answered Mar 10 '23 11:03

OneCricketeer


Keep one line and one JsValue in file, remove .option("multiLine", true). like this:

{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}

enter image description here

like image 29
PerkinsZhu Avatar answered Mar 10 '23 10:03

PerkinsZhu