I have a multiLine json file, and I am using spark's read.json to read the json, the problem is that it is only reading the first object from that json file
val dataFrame = spark.read.option("multiLine", true).option("mode", "PERMISSIVE").json(path)
dataFrame.rdd.saveAsTextFile("DataFrame")
Sample json:
{
"_id" : "589895e123c572923e69f5e7",
"thing" : "54eb45beb5f1e061454c5bf4",
"timeline" : [
{
"reason" : "TRIP_START",
"timestamp" : "2017-02-06T17:20:18.007+02:00",
"type" : "TRIP_EVENT",
"location" : [
11.1174091,
69.1174091
],
"endLocation" : [],
"startLocation" : []
},
"reason" : "TRIP_END",
"timestamp" : "2017-02-06T17:25:26.026+02:00",
"type" : "TRIP_EVENT",
"location" : [
11.5691428,
48.1122443
],
"endLocation" : [],
"startLocation" : []
}
],
"__v" : 0
}
{
"_id" : "589895e123c572923e69f5e8",
"thing" : "54eb45beb5f1e032241c5bf4",
"timeline" : [
{
"reason" : "TRIP_START",
"timestamp" : "2017-02-06T17:20:18.007+02:00",
"type" : "TRIP_EVENT",
"location" : [
11.1174091,
50.1174091
],
"endLocation" : [],
"startLocation" : []
},
"reason" : "TRIP_END",
"timestamp" : "2017-02-06T17:25:26.026+02:00",
"type" : "TRIP_EVENT",
"location" : [
51.1174091,
69.1174091
],
"endLocation" : [],
"startLocation" : []
}
],
"__v" : 0
}
I get only the first entry with id = 589895e123c572923e69f5e7
.
Is there something that I am doing wrong?
Are you sure multiple multi line JSON is supported?
Each line must contain a separate, self-contained valid JSON object... For a regular multi-line JSON file, set the multiLine option to true
http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets
Where a "regular JSON file" means the entire file is a singular JSON object / array, however, simply putting {}
around your data won't work because you need a key for every object, and so you'd need a top level key, maybe say "objects"
. Similarly, you can try an array, but wrapping with []
. Either way, these will only work if every object in that array or object is separated by commas.
tl;dr - the whole file needs to be one valid JSON object when multiline=true
You're only getting one object because it parses the first set of brackets, and that's it.
If you have full control over the JSON file, the indented layout is purely for human consumption. Just flatten the objects and let Spark parse it as the API is intended to be used
Keep one line and one JsValue in file, remove .option("multiLine", true)
.
like this:
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With