I am getting following error on reading a large 6gb single line json file:
Job aborted due to stage failure: Task 5 in stage 0.0 failed 1 times, most recent failure: Lost task 5.0 in stage 0.0 (TID 5, localhost): java.io.IOException: Too many bytes before newline: 2147483648
spark does not read json files with new lines hence the entire 6 gb json file is on a single line:
jf = sqlContext.read.json("jlrn2.json")
configuration:
spark.driver.memory 20g
Yep, you have more than Integer.MAX_VALUE
bytes in your line. You need to split it up.
Keep in mind that Spark is expecting each line to be a valid JSON document, not the file as a whole. Below is the relevant line from the Spark SQL Progamming Guide
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.
So if your JSON document is in the form...
[
{ [record] },
{ [record] }
]
You'll want to change it to
{ [record] }
{ [record] }
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With