Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to run Spark Scala code on Amazon EMR

I am trying to run the following piece of Spark code written in Scala on Amazon EMR:

import org.apache.spark.{SparkConf, SparkContext}

object TestRunner {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("Hello World")
    val sc = new SparkContext(conf)

    val words = sc.parallelize(Seq("a", "b", "c", "d", "e"))
    val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
    println(wordCounts)
  }
}

This is the script I am using to deploy the above code into EMR:

#!/usr/bin/env bash

set -euxo pipefail

cluster_id='j-XXXXXXXXXX'

app_name="HelloWorld"

main_class="TestRunner"
jar_name="HelloWorld-assembly-0.0.1-SNAPSHOT.jar"
jar_path="target/scala-2.11/${jar_name}"
s3_jar_dir="s3://jars/"
s3_jar_path="${s3_jar_dir}${jar_name}"
###################################################

sbt assembly

aws s3 cp ${jar_path} ${s3_jar_dir}

aws emr add-steps --cluster-id ${cluster_id} --steps Type=spark,Name=${app_name},Args=[--deploy-mode,cluster,--master,yarn-cluster,--class,${main_class},${s3_jar_path}],ActionOnFailure=CONTINUE

But, this exits with producing no output at all in AWS after few minutes!

Here's my controller's output:

2016-10-20T21:03:17.043Z INFO Ensure step 3 jar file command-runner.jar
2016-10-20T21:03:17.043Z INFO StepRunner: Created Runner for step 3
INFO startExec 'hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --deploy-mode cluster --class TestRunner s3://jars/mscheiber/HelloWorld-assembly-0.0.1-SNAPSHOT.jar'
INFO Environment:
  PATH=/sbin:/usr/sbin:/bin:/usr/bin:/usr/local/sbin:/opt/aws/bin
  LESS_TERMCAP_md=[01;38;5;208m
  LESS_TERMCAP_me=[0m
  HISTCONTROL=ignoredups
  LESS_TERMCAP_mb=[01;31m
  AWS_AUTO_SCALING_HOME=/opt/aws/apitools/as
  UPSTART_JOB=rc
  LESS_TERMCAP_se=[0m
  HISTSIZE=1000
  HADOOP_ROOT_LOGGER=INFO,DRFA
  JAVA_HOME=/etc/alternatives/jre
  AWS_DEFAULT_REGION=us-east-1
  AWS_ELB_HOME=/opt/aws/apitools/elb
  LESS_TERMCAP_us=[04;38;5;111m
  EC2_HOME=/opt/aws/apitools/ec2
  TERM=linux
  XFILESEARCHPATH=/usr/dt/app-defaults/%L/Dt
  runlevel=3
  LANG=en_US.UTF-8
  AWS_CLOUDWATCH_HOME=/opt/aws/apitools/mon
  MAIL=/var/spool/mail/hadoop
  LESS_TERMCAP_ue=[0m
  LOGNAME=hadoop
  PWD=/
  LANGSH_SOURCED=1
  HADOOP_CLIENT_OPTS=-Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/s-3UAS8JQ0KEOV3/tmp
  _=/etc/alternatives/jre/bin/java
  CONSOLETYPE=serial
  RUNLEVEL=3
  LESSOPEN=||/usr/bin/lesspipe.sh %s
  previous=N
  UPSTART_EVENTS=runlevel
  AWS_PATH=/opt/aws
  USER=hadoop
  UPSTART_INSTANCE=
  PREVLEVEL=N
  HADOOP_LOGFILE=syslog
  HOSTNAME=ip-10-17-186-102
  NLSPATH=/usr/dt/lib/nls/msg/%L/%N.cat
  HADOOP_LOG_DIR=/mnt/var/log/hadoop/steps/s-3UAS8JQ0KEOV3
  EC2_AMITOOL_HOME=/opt/aws/amitools/ec2
  SHLVL=5
  HOME=/home/hadoop
  HADOOP_IDENT_STRING=hadoop
INFO redirectOutput to /mnt/var/log/hadoop/steps/s-3UAS8JQ0KEOV3/stdout
INFO redirectError to /mnt/var/log/hadoop/steps/s-3UAS8JQ0KEOV3/stderr
INFO Working dir /mnt/var/lib/hadoop/steps/s-3UAS8JQ0KEOV3
INFO ProcessRunner started child process 24549 :
hadoop    24549   4780  0 21:03 ?        00:00:00 bash /usr/lib/hadoop/bin/hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --deploy-mode cluster --class TestRunner s3://jars/TestRunner-assembly-0.0.1-SNAPSHOT.jar
2016-10-20T21:03:21.050Z INFO HadoopJarStepRunner.Runner: startRun() called for s-3UAS8JQ0KEOV3 Child Pid: 24549
INFO Synchronously wait child process to complete : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO waitProcessCompletion ended with exit code 0 : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO total process run time: 44 seconds
2016-10-20T21:04:03.102Z INFO Step created jobs: 
2016-10-20T21:04:03.103Z INFO Step succeeded with exitCode 0 and took 44 seconds

The syslog and stdout is empty and this is in my stderr:

16/10/20 21:03:20 INFO RMProxy: Connecting to ResourceManager at ip-10-17-186-102.ec2.internal/10.17.186.102:8032
16/10/20 21:03:21 INFO Client: Requesting a new application from cluster with 2 NodeManagers
16/10/20 21:03:21 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (53248 MB per container)
16/10/20 21:03:21 INFO Client: Will allocate AM container, with 53247 MB memory including 4840 MB overhead
16/10/20 21:03:21 INFO Client: Setting up container launch context for our AM
16/10/20 21:03:21 INFO Client: Setting up the launch environment for our AM container
16/10/20 21:03:21 INFO Client: Preparing resources for our AM container
16/10/20 21:03:21 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
16/10/20 21:03:22 INFO Client: Uploading resource file:/mnt/tmp/spark-6fceeedf-0ad5-4df1-a63e-c1d7eb1b95b4/__spark_libs__5484581201997889110.zip -> hdfs://ip-10-17-186-102.ec2.internal:8020/user/hadoop/.sparkStaging/application_1476995377469_0002/__spark_libs__5484581201997889110.zip
16/10/20 21:03:24 INFO Client: Uploading resource s3://jars/HelloWorld-assembly-0.0.1-SNAPSHOT.jar -> hdfs://ip-10-17-186-102.ec2.internal:8020/user/hadoop/.sparkStaging/application_1476995377469_0002/DataScience-assembly-0.0.1-SNAPSHOT.jar
16/10/20 21:03:24 INFO S3NativeFileSystem: Opening 's3://jars/HelloWorld-assembly-0.0.1-SNAPSHOT.jar' for reading
16/10/20 21:03:26 INFO Client: Uploading resource file:/mnt/tmp/spark-6fceeedf-0ad5-4df1-a63e-c1d7eb1b95b4/__spark_conf__5724047842379101980.zip -> hdfs://ip-10-17-186-102.ec2.internal:8020/user/hadoop/.sparkStaging/application_1476995377469_0002/__spark_conf__.zip
16/10/20 21:03:26 INFO SecurityManager: Changing view acls to: hadoop
16/10/20 21:03:26 INFO SecurityManager: Changing modify acls to: hadoop
16/10/20 21:03:26 INFO SecurityManager: Changing view acls groups to: 
16/10/20 21:03:26 INFO SecurityManager: Changing modify acls groups to: 
16/10/20 21:03:26 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(hadoop); groups with view permissions: Set(); users  with modify permissions: Set(hadoop); groups with modify permissions: Set()
16/10/20 21:03:26 INFO Client: Submitting application application_1476995377469_0002 to ResourceManager
16/10/20 21:03:26 INFO YarnClientImpl: Submitted application application_1476995377469_0002
16/10/20 21:03:27 INFO Client: Application report for application_1476995377469_0002 (state: ACCEPTED)
16/10/20 21:03:27 INFO Client: 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: N/A
     ApplicationMaster RPC port: -1
     queue: default
     start time: 1476997406896
     final status: UNDEFINED
     tracking URL: http://ip-10-17-186-102.ec2.internal:20888/proxy/application_1476995377469_0002/
     user: hadoop
16/10/20 21:03:28 INFO Client: Application report for application_1476995377469_0002 (state: ACCEPTED)
16/10/20 21:03:29 INFO Client: Application report for application_1476995377469_0002 (state: ACCEPTED)
16/10/20 21:03:30 INFO Client: Application report for application_1476995377469_0002 (state: ACCEPTED)
16/10/20 21:03:31 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:31 INFO Client: 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: 10.17.181.184
     ApplicationMaster RPC port: 0
     queue: default
     start time: 1476997406896
     final status: UNDEFINED
     tracking URL: http://ip-10-17-186-102.ec2.internal:20888/proxy/application_1476995377469_0002/
     user: hadoop
16/10/20 21:03:32 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:33 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:34 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:35 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:36 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:37 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:38 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:39 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:40 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:41 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:42 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:43 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:44 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:45 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:46 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:47 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:48 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:49 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:50 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:51 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:52 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:53 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:54 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:55 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:56 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:57 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:58 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:59 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:04:00 INFO Client: Application report for application_1476995377469_0002 (state: FINISHED)
16/10/20 21:04:00 INFO Client: 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: 10.17.181.184
     ApplicationMaster RPC port: 0
     queue: default
     start time: 1476997406896
     final status: SUCCEEDED
     tracking URL: http://ip-10-17-186-102.ec2.internal:20888/proxy/application_1476995377469_0002/
     user: hadoop
16/10/20 21:04:00 INFO Client: Deleting staging directory hdfs://ip-10-17-186-102.ec2.internal:8020/user/hadoop/.sparkStaging/application_1476995377469_0002
16/10/20 21:04:00 INFO ShutdownHookManager: Shutdown hook called
16/10/20 21:04:00 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-6fceeedf-0ad5-4df1-a63e-c1d7eb1b95b4
Command exiting with ret '0'

What am I missing?

like image 655
pathikrit Avatar asked Oct 20 '16 21:10

pathikrit


1 Answers

Looks like your application succeeded just fine. However, there are two reasons why you don't see any output in the step's stout logs.

1) You ran the application in yarn-cluster mode, which means that the driver runs on a random cluster node rather than on the master node. If you specified an S3 log uri when creating the cluster, you should see the logs for this application in the containers directory of your S3 bucket. The logs for the driver will be in container #0's logs.

2) You did not call anything like "collect()" to bring data from the Spark executors back to the driver, so your println() at the end is not printing the data anyway but rather a toString() representation of the RDD. You probably want to do something like .collect().foreach(println) instead.

like image 100
Jonathan Kelly Avatar answered Nov 05 '22 12:11

Jonathan Kelly