I have a large nested NDJ (new line delimited JSON) file that I need to read into a single spark dataframe and save to parquet. In an attempt to render the schema I use this function:
def flattenSchema(schema: StructType, prefix: String = null) : Array[Column] = {
schema.fields.flatMap(f => {
val colName = if (prefix == null) f.name else (prefix + "." + f.name)
f.dataType match {
case st: StructType => flattenSchema(st, colName)
case _ => Array(col(colName))
}
})
}
on the dataframe that is returned by reading by
val df = sqlCtx.read.json(sparkContext.wholeTextFiles(path).values)
I've also switched this to val df = spark.read.json(path)
so that this only works with NDJs and not multi-line JSON--same error.
This is causing an out of memory error on the workers
java.lang.OutOfMemoryError: Java heap space
.
I've altered the jvm memory options and spark executor/driver options to no avail
Is there a way to stream the file, flatten the schema, and add to a dataframe incrementally? Some lines of the JSON contain new fields from the preceding entires...so those would need to be filled in later.
No work around. The issue was with the JVM object limit. I ended up using a scala json parser and built the dataframe manually.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With