Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the proper way of running a Spark application on YARN using Oozie (with Hue)?

I have written an application in Scala that uses Spark.
The application consists of two modules - the App module which contains classes with different logic, and the Env module which contains environment and system initialization code, as well as utility functions.
The entry point is located in Env, and after initialization, it creates a class in App (according to args, using Class.forName) and the logic is executed.
The modules are exported into 2 different JARs (namely, env.jar and app.jar).

When I run the application locally, it executes well. The next step is to deploy the application to my servers. I use Cloudera's CDH 5.4.

I used Hue to create a new Oozie workflow with a Spark task with the following parameters:

  • Spark Master: yarn
  • Mode: cluster
  • App name: myApp
  • Jars/py files: lib/env.jar,lib/app.jar
  • Main class: env.Main (in Env module)
  • Arguments: app.AggBlock1Task

I then placed the 2 JARs inside the lib folder in the workflow's folder (/user/hue/oozie/workspaces/hue-oozie-1439807802.48).

When I run the workflow, it throws a FileNotFoundException and the application does not execute:

java.io.FileNotFoundException: File file:/cloudera/yarn/nm/usercache/danny/appcache/application_1439823995861_0029/container_1439823995861_0029_01_000001/lib/app.jar,lib/env.jar does not exist

However, when I leave the Spark master and mode parameters empty, it all works properly, but when I check spark.master programmatically it is set to local[*] and not yarn. Also, when observing the logs, I encountered this under Oozie Spark action configuration:

--master
null
--name
myApp
--class
env.Main
--verbose
lib/env.jar,lib/app.jar
app.AggBlock1Task

I assume I'm not doing it right - not setting Spark master and mode parameters and running the application with spark.master set to local[*]. As far as I understand, creating a SparkConf object within the application should set the spark.master property to whatever I specify in Oozie (in this case yarn) but it just doesn't work when I do that..

Is there something I'm doing wrong or missing?
Any help will be much appreciated!

like image 239
Danny Goodwill Avatar asked Oct 30 '22 21:10

Danny Goodwill


1 Answers

I managed to solve the problem by putting the two JARs in the user directory /user/danny/app/ and specifying the Jar/py files parameter as ${nameNode}/user/danny/app/env.jar. Running it caused a ClassNotFoundException to be thrown, even though the JAR was located at the same folder in HDFS. To work around that, I had to go to the settings and add the following to the options list: --jars ${nameNode}/user/danny/app/app.jar. This way the App module is referenced as well and the application runs successfully.

like image 137
Danny Goodwill Avatar answered Nov 15 '22 10:11

Danny Goodwill