Reading massive JSON files into Spark Dataframe

Question

I have a large nested NDJ (new line delimited JSON) file that I need to read into a single spark dataframe and save to parquet. In an attempt to render the schema I use this function:

def flattenSchema(schema: StructType, prefix: String = null) : Array[Column] = {
        schema.fields.flatMap(f => {
          val colName = if (prefix == null) f.name else (prefix + "." + f.name)
          f.dataType match {
            case st: StructType => flattenSchema(st, colName)
            case _ => Array(col(colName))
          }
        })
  }

on the dataframe that is returned by reading by

val df = sqlCtx.read.json(sparkContext.wholeTextFiles(path).values)

I've also switched this to val df = spark.read.json(path) so that this only works with NDJs and not multi-line JSON--same error.

This is causing an out of memory error on the workers java.lang.OutOfMemoryError: Java heap space.

I've altered the jvm memory options and spark executor/driver options to no avail

Is there a way to stream the file, flatten the schema, and add to a dataframe incrementally? Some lines of the JSON contain new fields from the preceding entires...so those would need to be filled in later.

Anisotropic · Accepted Answer

No work around. The issue was with the JVM object limit. I ended up using a scala json parser and built the dataframe manually.

Reading massive JSON files into Spark Dataframe

Tags:

json

dataframe

scala

apache-spark

Anisotropic

1 Answers

Anisotropic

Recent Activity

Donate For Us

Reading massive JSON files into Spark Dataframe

Tags:

json

dataframe

scala

apache-spark

Anisotropic

1 Answers

Anisotropic

Related questions

Recent Activity

Donate For Us