I am running this code on EMR 4.6.0 + Spark 1.6.1 :
val sqlContext = SQLContext.getOrCreate(sc)
val inputRDD = sqlContext.read.json(input)
try {
inputRDD.filter("`first_field` is not null OR `second_field` is not null").toJSON.coalesce(10).saveAsTextFile(output)
logger.info("DONE!")
} catch {
case e : Throwable => logger.error("ERROR" + e.getMessage)
}
In the last stage of saveAsTextFile
, it fails with this error:
16/07/15 08:27:45 ERROR codegen.GenerateUnsafeProjection: failed to compile: org.codehaus.janino.JaninoRuntimeException: Constant pool has grown past JVM limit of 0xFFFF
/* 001 */
/* 002 */ public java.lang.Object generate(org.apache.spark.sql.catalyst.expressions.Expression[] exprs) {
/* 003 */ return new SpecificUnsafeProjection(exprs);
/* 004 */ }
(...)
What could be the reason? Thanks
Solved this problem by dropping all the unused column in the Dataframe, or just filter columns you actually need.
Turnes out Spark Dataframe can not handle super wide schemas. There is no specific number of columns where Spark might break with “Constant pool has grown past JVM limit of 0xFFFF” - it depends on kind of query, but reducing number of columns can help to workaround this issue.
The underlying root cause is in JVM's 64kb for generated Java classes - see also Andrew's answer.
This is due to known limitation of Java for generated classes to go beyond 64Kb.
This limitation has been worked around in SPARK-18016 which is fixed in Spark 2.3 - will be released in Jan/2018.
For future reference, this issue was fixed in spark 2.3
(As Andrew noted).
If you encounter this issue on Amazon EMR, upgrade to release version 5.13
or above.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With