i just set up an AWS EMR Cluster (EMR Version 5.18 with Spark 2.3.2). I ssh into the master maschine and run spark-shell or pyspark and get the following error: <pre class="prettyprint"><code>$ spark-shell log4j:ERROR setFile(null,true) call failed. java.io.FileNotFoundException: /stderr (Permission denied) at java.io.FileOutputStream.open0(Native Method) at java.io.FileOutputStream.open(FileOutputStream.java:270) at java.io.FileOutputStream.<init>(FileOutputStream.java:213) at java.io.FileOutputStream.<init>(FileOutputStream.java:133) at org.apache.log4j.FileAppender.setFile(FileAppender.java:294) at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:165) at org.apache.log4j.DailyRollingFileAppender.activateOptions(DailyRollingFileAppender.java:223) at org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:307) at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:172) at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:104) at org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:842) at org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:768) at org.apache.log4j.PropertyConfigurator.parseCatsAndRenderers(PropertyConfigurator.java:672) at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:516) at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:580) at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526) at org.apache.log4j.LogManager.<clinit>(LogManager.java:127) at org.apache.spark.internal.Logging$class.initializeLogging(Logging.scala:120) at org.apache.spark.internal.Logging$class.initializeLogIfNecessary(Logging.scala:108) at org.apache.spark.deploy.SparkSubmit$.initializeLogIfNecessary(SparkSubmit.scala:71) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:128) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) log4j:ERROR Either File or DatePattern options are not set for appender [DRFA-stderr]. log4j:ERROR setFile(null,true) call failed. java.io.FileNotFoundException: /stdout (Permission denied) at java.io.FileOutputStream.open0(Native Method) at java.io.FileOutputStream.open(FileOutputStream.java:270) at java.io.FileOutputStream.<init>(FileOutputStream.java:213) at java.io.FileOutputStream.<init>(FileOutputStream.java:133) at org.apache.log4j.FileAppender.setFile(FileAppender.java:294) at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:165) at org.apache.log4j.DailyRollingFileAppender.activateOptions(DailyRollingFileAppender.java:223) at org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:307) at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:172) at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:104) at org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:842) at org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:768) at org.apache.log4j.PropertyConfigurator.parseCatsAndRenderers(PropertyConfigurator.java:672) at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:516) at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:580) at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526) at org.apache.log4j.LogManager.<clinit>(LogManager.java:127) at org.apache.spark.internal.Logging$class.initializeLogging(Logging.scala:120) at org.apache.spark.internal.Logging$class.initializeLogIfNecessary(Logging.scala:108) at org.apache.spark.deploy.SparkSubmit$.initializeLogIfNecessary(SparkSubmit.scala:71) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:128) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) log4j:ERROR Either File or DatePattern options are not set for appender [DRFA-stdout]. Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 18/11/04 12:24:32 ERROR SparkContext: Error initializing SparkContext. java.lang.IllegalArgumentException: Required executor memory (4608+460 MB) is above the max threshold (3072 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'. at org.apache.spark.deploy.yarn.Client.verifyClusterResources(Client.scala:318) at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:166) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164) at org.apache.spark.SparkContext.<init>(SparkContext.scala:500) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2493) at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:934) at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:925) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:925) at org.apache.spark.repl.Main$.createSparkSession(Main.scala:103) at $line3.$read$$iw$$iw.<init>(<console>:15) at $line3.$read$$iw.<init>(<console>:43) at $line3.$read.<init>(<console>:45) at $line3.$read$.<init>(<console>:49) at $line3.$read$.<clinit>(<console>) at $line3.$eval$.$print$lzycompute(<console>:7) at $line3.$eval$.$print(<console>:6) at $line3.$eval.$print(<console>) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786) at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047) at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638) at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637) at scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31) at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19) at scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637) at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569) at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565) at scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:807) at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:681) at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:395) at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1$$anonfun$apply$mcV$sp$1$$anonfun$apply$mcV$sp$2.apply(SparkILoop.scala:79) at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1$$anonfun$apply$mcV$sp$1$$anonfun$apply$mcV$sp$2.apply(SparkILoop.scala:79) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SparkILoop.scala:79) at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1$$anonfun$apply$mcV$sp$1.apply(SparkILoop.scala:79) at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1$$anonfun$apply$mcV$sp$1.apply(SparkILoop.scala:79) at scala.tools.nsc.interpreter.ILoop.savingReplayStack(ILoop.scala:91) at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply$mcV$sp(SparkILoop.scala:78) at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:78) at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:78) at scala.tools.nsc.interpreter.IMain.beQuietDuring(IMain.scala:214) at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:77) at org.apache.spark.repl.SparkILoop.loadFiles(SparkILoop.scala:110) at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:920) at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909) at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909) at scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97) at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:909) at org.apache.spark.repl.Main$.doMain(Main.scala:76) at org.apache.spark.repl.Main$.main(Main.scala:56) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 18/11/04 12:24:33 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered! 18/11/04 12:24:33 WARN MetricsSystem: Stopping a MetricsSystem that is not running java.lang.IllegalArgumentException: Required executor memory (4608+460 MB) is above the max threshold (3072 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'. at org.apache.spark.deploy.yarn.Client.verifyClusterResources(Client.scala:318) at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:166) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164) at org.apache.spark.SparkContext.<init>(SparkContext.scala:500) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2493) at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:934) at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:925) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:925) at org.apache.spark.repl.Main$.createSparkSession(Main.scala:103) ... 55 elided <console>:14: error: not found: value spark import spark.implicits._ ^ <console>:14: error: not found: value spark import spark.sql ^ Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.3.2 /_/ Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_181) Type in expressions to have them evaluated. Type :help for more information. </code></pre> I am new to Spark and to EMR and dont know what to do. Is there some configuration step i missed or anything else i have to provide to make it work? Thank you for your help!

If you look into <code>/etc/spark/conf/log4j.properties</code> file, you'll find that there's new setup allowing to roll Spark Streaming logs hourly (probably as it's suggested here). The problem occurs because <code>${spark.yarn.app.container.log.dir}</code> system property is not set in Spark driver process. The property is set eventually to Yarn's container log directory, but this happens later (look here and here). In order to fix this error in Spark driver, add the following to your <code>spark-submit</code> or <code>spark-shell</code> command: <code> --driver-java-options='-Dspark.yarn.app.container.log.dir=/mnt/var/log/hadoop' </code> Please note that <code>/mnt/var/log/hadoop/stderr</code> and <code>/mnt/var/log/hadoop/stdout</code> files will be reused by all the (Spark Streaming) processes started on the same node.

We have also run into this issue and hope some AWS or Spark engineers are reading this. I've narrowed this down to the <code>/etc/spark/conf/log4j.properties</code> file and how the loggers are configured using the <code>${spark.yarn.app.container.log.dir}</code> system property. That value is evaluating to <code>null</code> and so the logging directory now evaluates to <code>/stdout</code> and <code>/stderr</code> instead of the desired <code>/mnt/var/log/hadoop-yarn/containers/<app_id>/<container_id>/(stdout|stderr)</code> which is how it worked in EMR < 5.18.0. Workaround #1 (not ideal): If you set that property to a static path which the hadoop user has access to like <code>/var/log/hadoop-yarn/stderr</code> things work fine. This probably breaks things like the history server and an unknown number of other things, but spark-shell and pyspark can start without errors. UPDATE Workaround #2 (revert): Not sure why I didn't do this earlier but comparing this to a 5.13 cluster, the entirety of the DRFA-stderr and DRFA-stdout appenders were non-existent. If you comment those sections out, delete them, or simply copy the log4j.properties file from the template this problem also goes away (again, unknown impact to the rest of the services). I'm not sure where that section originated from, the master repo configs do not have those appenders so it appears to be proprietary to AWS distros.

Can't get a SparkContext in new AWS EMR Cluster

Tags:

amazon-web-services

apache-spark

pyspark

amazon-emr

i just set up an AWS EMR Cluster (EMR Version 5.18 with Spark 2.3.2). I ssh into the master maschine and run spark-shell or pyspark and get the following error:

$ spark-shell

log4j:ERROR setFile(null,true) call failed.
java.io.FileNotFoundException: /stderr (Permission denied)
        at java.io.FileOutputStream.open0(Native Method)
        at java.io.FileOutputStream.open(FileOutputStream.java:270)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:133)
        at org.apache.log4j.FileAppender.setFile(FileAppender.java:294)
        at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:165)
        at org.apache.log4j.DailyRollingFileAppender.activateOptions(DailyRollingFileAppender.java:223)
        at org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:307)
        at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:172)
        at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:104)
        at org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:842)
        at org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:768)
        at org.apache.log4j.PropertyConfigurator.parseCatsAndRenderers(PropertyConfigurator.java:672)
        at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:516)
        at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:580)
        at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526)
        at org.apache.log4j.LogManager.<clinit>(LogManager.java:127)
        at org.apache.spark.internal.Logging$class.initializeLogging(Logging.scala:120)
        at org.apache.spark.internal.Logging$class.initializeLogIfNecessary(Logging.scala:108)
        at org.apache.spark.deploy.SparkSubmit$.initializeLogIfNecessary(SparkSubmit.scala:71)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:128)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
log4j:ERROR Either File or DatePattern options are not set for appender [DRFA-stderr].
log4j:ERROR setFile(null,true) call failed.
java.io.FileNotFoundException: /stdout (Permission denied)
        at java.io.FileOutputStream.open0(Native Method)
        at java.io.FileOutputStream.open(FileOutputStream.java:270)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:133)
        at org.apache.log4j.FileAppender.setFile(FileAppender.java:294)
        at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:165)
        at org.apache.log4j.DailyRollingFileAppender.activateOptions(DailyRollingFileAppender.java:223)
        at org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:307)
        at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:172)
        at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:104)
        at org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:842)
        at org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:768)
        at org.apache.log4j.PropertyConfigurator.parseCatsAndRenderers(PropertyConfigurator.java:672)
        at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:516)
        at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:580)
        at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526)
        at org.apache.log4j.LogManager.<clinit>(LogManager.java:127)
        at org.apache.spark.internal.Logging$class.initializeLogging(Logging.scala:120)
        at org.apache.spark.internal.Logging$class.initializeLogIfNecessary(Logging.scala:108)
        at org.apache.spark.deploy.SparkSubmit$.initializeLogIfNecessary(SparkSubmit.scala:71)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:128)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
log4j:ERROR Either File or DatePattern options are not set for appender [DRFA-stdout].
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/11/04 12:24:32 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: Required executor memory (4608+460 MB) is above the max threshold (3072 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'.
        at org.apache.spark.deploy.yarn.Client.verifyClusterResources(Client.scala:318)
        at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:166)
        at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57)
        at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164)
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:500)
        at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2493)
        at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:934)
        at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:925)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:925)
        at org.apache.spark.repl.Main$.createSparkSession(Main.scala:103)
        at $line3.$read$$iw$$iw.<init>(<console>:15)
        at $line3.$read$$iw.<init>(<console>:43)
        at $line3.$read.<init>(<console>:45)
        at $line3.$read$.<init>(<console>:49)
        at $line3.$read$.<clinit>(<console>)
        at $line3.$eval$.$print$lzycompute(<console>:7)
        at $line3.$eval$.$print(<console>:6)
        at $line3.$eval.$print(<console>)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786)
        at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047)
        at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638)
        at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637)
        at scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
        at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
        at scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637)
        at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569)
        at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565)
        at scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:807)
        at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:681)
        at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:395)
        at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1$$anonfun$apply$mcV$sp$1$$anonfun$apply$mcV$sp$2.apply(SparkILoop.scala:79)
        at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1$$anonfun$apply$mcV$sp$1$$anonfun$apply$mcV$sp$2.apply(SparkILoop.scala:79)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SparkILoop.scala:79)
        at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1$$anonfun$apply$mcV$sp$1.apply(SparkILoop.scala:79)
        at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1$$anonfun$apply$mcV$sp$1.apply(SparkILoop.scala:79)
        at scala.tools.nsc.interpreter.ILoop.savingReplayStack(ILoop.scala:91)
        at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply$mcV$sp(SparkILoop.scala:78)
        at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:78)
        at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:78)
        at scala.tools.nsc.interpreter.IMain.beQuietDuring(IMain.scala:214)
        at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:77)
        at org.apache.spark.repl.SparkILoop.loadFiles(SparkILoop.scala:110)
        at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:920)
        at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
        at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
        at scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97)
        at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:909)
        at org.apache.spark.repl.Main$.doMain(Main.scala:76)
        at org.apache.spark.repl.Main$.main(Main.scala:56)
        at org.apache.spark.repl.Main.main(Main.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
18/11/04 12:24:33 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
18/11/04 12:24:33 WARN MetricsSystem: Stopping a MetricsSystem that is not running
java.lang.IllegalArgumentException: Required executor memory (4608+460 MB) is above the max threshold (3072 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'.
  at org.apache.spark.deploy.yarn.Client.verifyClusterResources(Client.scala:318)
  at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:166)
  at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57)
  at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164)
  at org.apache.spark.SparkContext.<init>(SparkContext.scala:500)
  at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2493)
  at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:934)
  at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:925)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:925)
  at org.apache.spark.repl.Main$.createSparkSession(Main.scala:103)
  ... 55 elided
<console>:14: error: not found: value spark
       import spark.implicits._
              ^
<console>:14: error: not found: value spark
       import spark.sql
              ^
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.2
      /_/

Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_181)
Type in expressions to have them evaluated.
Type :help for more information.

I am new to Spark and to EMR and dont know what to do. Is there some configuration step i missed or anything else i have to provide to make it work?

Thank you for your help!

679

asked Nov 04 '18 12:11

Daniel Dev

Video Answer

2 Answers

If you look into /etc/spark/conf/log4j.properties file, you'll find that there's new setup allowing to roll Spark Streaming logs hourly (probably as it's suggested here).

The problem occurs because ${spark.yarn.app.container.log.dir} system property is not set in Spark driver process. The property is set eventually to Yarn's container log directory, but this happens later (look here and here).

In order to fix this error in Spark driver, add the following to your spark-submit or spark-shell command: --driver-java-options='-Dspark.yarn.app.container.log.dir=/mnt/var/log/hadoop'

Please note that /mnt/var/log/hadoop/stderr and /mnt/var/log/hadoop/stdout files will be reused by all the (Spark Streaming) processes started on the same node.

132

answered Nov 16 '22 02:11

Michael Spector

We have also run into this issue and hope some AWS or Spark engineers are reading this. I've narrowed this down to the /etc/spark/conf/log4j.properties file and how the loggers are configured using the ${spark.yarn.app.container.log.dir} system property. That value is evaluating to null and so the logging directory now evaluates to /stdout and /stderr instead of the desired /mnt/var/log/hadoop-yarn/containers/<app_id>/<container_id>/(stdout|stderr) which is how it worked in EMR < 5.18.0.

Workaround #1 (not ideal): If you set that property to a static path which the hadoop user has access to like /var/log/hadoop-yarn/stderr things work fine. This probably breaks things like the history server and an unknown number of other things, but spark-shell and pyspark can start without errors.

UPDATE Workaround #2 (revert): Not sure why I didn't do this earlier but comparing this to a 5.13 cluster, the entirety of the DRFA-stderr and DRFA-stdout appenders were non-existent. If you comment those sections out, delete them, or simply copy the log4j.properties file from the template this problem also goes away (again, unknown impact to the rest of the services). I'm not sure where that section originated from, the master repo configs do not have those appenders so it appears to be proprietary to AWS distros.

answered Nov 16 '22 01:11

Diffuser

Related questions
                            
                                How to programmatically list all aws resources and tags [closed]
                            
                                Custom filtering on subscription in AWS AppSync
                            
                                How to name an Auto Scaling Group in a CloudFormation template?
                            
                                Read parquet data from AWS s3 bucket
                            
                                How can add/verify dynamically domains and email addresses to amazon SES?
                            
                                AWS API Gateway MTLS client auth
                            
                                beginners guide to Amazon Web Services (EC2, S3, RDS, EBS etc) [closed]
                            
                                Is Amazon Web Services reasonably priced for a personal server? [closed]
                            
                                Efficient way to check whether SQS queue is empty
                            
                                AWS codeBuild/codePipeline with serverless framework
                            
                                aws cloudwatch logs filter pattern include lines before and/or after matching pattern
                            
                                Getting cognito user pool username from cognito identity pool identityId
                            
                                Storing My Amazon Credentials in C# Desktop App
                            
                                Logs for actions on amazon s3 / other AWS services
                            
                                DynamoDB consistent reads for Global Secondary Index
                            
                                Deploy a GitHub branch automatically to AWS Elastic Beanstalk
                            
                                Update Amazon RDS SSL/TLS Certificates - Elastic Beanstalk
                            
                                Amazon S3 architecture [closed]
                            
                                AWS Elastic Beanstalk, Dockerrun.aws.json and multiple ports on docker run [duplicate]
                            
                                Cant connect dynamo Db from my vpc configured lambda function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With