I know at least 2 ways to get my dependencies into a Spark EMR job. One is to create a fat jar and another is to specify which packages you want in spark submit using the --packages
option.
The fat jar takes quite a long time to zip up. Is that normal? ~10 minutes. Is it possible that we have it incorrectly configured?
The command line option is fine, but error prone.
Are there any alternatives? I'd like it if there (already existed) a way to include the dependency list in the jar with gradle, then have it download them. Is this possible? Are there other alternatives?
Update: I'm posting a partial answer. One thing I didn't make clear in my original question was that I also care about when you have dependency conflicts because you have the same jar with different versions.
Update
Thank you for the responses relating to cutting back the number of dependencies or using provided where possible. For the sake of this question, lets say we have the minimal number of dependencies necessary to run the jar.
HubSpot has a (partial) solution: SlimFast. You can find an explanation here http://product.hubspot.com/blog/the-fault-in-our-jars-why-we-stopped-building-fat-jars and you can find the code here https://github.com/HubSpot/SlimFast
Effectively it stores all the jars it'll ever need on s3, so when it builds it does it without packaging the jars, but when it needs to run it gets them from s3
. So you're builds are quick, and downloads don't take long.
I think if this also had the ability to shade the jar's paths on upload, in order to avoid conflicts, then it would be a perfect solution.
Spark launcher can used if spark job has to be launched through some application with the help of Spark launcher you can configure your jar patah and no need to create fat.jar for runing application.
With a fat-jar you have to have Java installed and launching the Spark application requires executing java -jar [your-fat-jar-here]. It's hard to automate it if you want to, say, launch the application from a web application.
With SparkLauncher you're given the option of launching a Spark application from another application, e.g. the web application above. It is just much easier.
import org.apache.spark.launcher.SparkLauncher
SparkLauncher extends App {
val spark = new SparkLauncher()
.setSparkHome("/home/knoldus/spark-1.4.0-bin-hadoop2.6")
.setAppResource("/home/knoldus/spark_launcher-assembly-1.0.jar")
.setMainClass("SparkApp")
.setMaster("local[*]")
.launch();
spark.waitFor();
}
Code: https://github.com/phalodi/Spark-launcher
Here
setSparkHome(“/home/knoldus/spark-1.4.0-bin-hadoop2.6”) is use to set spark home which is use internally to call spark submit.
.setAppResource(“/home/knoldus/spark_launcher-assembly-1.0.jar”) is use to specify jar of our spark application.
.setMainClass(“SparkApp”) the entry point of the spark program i.e driver program.
.setMaster(“local[*]”) set the address of master where its start here now we run it on loacal machine.
.launch() is simply start our spark application
What are the benefits of SparkLauncher vs java -jar fat-jar?
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-SparkLauncher.html
https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/launcher/SparkLauncher.html
http://henningpetersen.com/post/22/running-apache-spark-jobs-from-applications
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With