Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading massive JSON files into Spark Dataframe

I have a large nested NDJ (new line delimited JSON) file that I need to read into a single spark dataframe and save to parquet. In an attempt to render the schema I use this function:

def flattenSchema(schema: StructType, prefix: String = null) : Array[Column] = {
        schema.fields.flatMap(f => {
          val colName = if (prefix == null) f.name else (prefix + "." + f.name)
          f.dataType match {
            case st: StructType => flattenSchema(st, colName)
            case _ => Array(col(colName))
          }
        })
  }

on the dataframe that is returned by reading by

val df = sqlCtx.read.json(sparkContext.wholeTextFiles(path).values)

I've also switched this to val df = spark.read.json(path) so that this only works with NDJs and not multi-line JSON--same error.

This is causing an out of memory error on the workers java.lang.OutOfMemoryError: Java heap space.

I've altered the jvm memory options and spark executor/driver options to no avail

Is there a way to stream the file, flatten the schema, and add to a dataframe incrementally? Some lines of the JSON contain new fields from the preceding entires...so those would need to be filled in later.

like image 776
Anisotropic Avatar asked Oct 29 '22 15:10

Anisotropic


1 Answers

No work around. The issue was with the JVM object limit. I ended up using a scala json parser and built the dataframe manually.

like image 105
Anisotropic Avatar answered Nov 09 '22 14:11

Anisotropic