Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark exception handling for json

I am trying to catch/ignore a parsing error when I'm reading a json file

val DF = sqlContext.jsonFile("file")

There are a couple of lines that aren't valid json objects, but the data is too large to go through individually (~1TB)

I've come across exception handling for mapping using import scala.util.Tryand in.map(a => Try(a.toInt)) referencing: how to handle the Exception in spark map() function?

How would I catch an exception when reading a json file with the function sqlContext.jsonFile?

Thanks!

like image 268
Kim Ngo Avatar asked Oct 30 '22 21:10

Kim Ngo


1 Answers

Unfortunately you are out of luck here. DataFrameReader.json which is used under the hood is pretty much all-or-nothing. If your input contains malformed lines you have to filter these manually. A basic solution could look like this:

import scala.util.parsing.json._

val df = sqlContext.read.json(
    sc.textFile("file").filter(JSON.parseFull(_).isDefined)
)

Since above validation is rather expensive you may prefer to drop jsonFile / read.json completely and to use parsed JSON lines directly.

like image 108
zero323 Avatar answered Nov 03 '22 00:11

zero323