Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark, Alternative to Fat Jar

I know at least 2 ways to get my dependencies into a Spark EMR job. One is to create a fat jar and another is to specify which packages you want in spark submit using the --packages option.

The fat jar takes quite a long time to zip up. Is that normal? ~10 minutes. Is it possible that we have it incorrectly configured?

The command line option is fine, but error prone.

Are there any alternatives? I'd like it if there (already existed) a way to include the dependency list in the jar with gradle, then have it download them. Is this possible? Are there other alternatives?

Update: I'm posting a partial answer. One thing I didn't make clear in my original question was that I also care about when you have dependency conflicts because you have the same jar with different versions.

Update

Thank you for the responses relating to cutting back the number of dependencies or using provided where possible. For the sake of this question, lets say we have the minimal number of dependencies necessary to run the jar.

like image 702
Carlos Bribiescas Avatar asked Sep 19 '17 19:09

Carlos Bribiescas


2 Answers

HubSpot has a (partial) solution: SlimFast. You can find an explanation here http://product.hubspot.com/blog/the-fault-in-our-jars-why-we-stopped-building-fat-jars and you can find the code here https://github.com/HubSpot/SlimFast

Effectively it stores all the jars it'll ever need on s3, so when it builds it does it without packaging the jars, but when it needs to run it gets them from s3. So you're builds are quick, and downloads don't take long.

I think if this also had the ability to shade the jar's paths on upload, in order to avoid conflicts, then it would be a perfect solution.

like image 117
Carlos Bribiescas Avatar answered Sep 20 '22 18:09

Carlos Bribiescas


Spark launcher can used if spark job has to be launched through some application with the help of Spark launcher you can configure your jar patah and no need to create fat.jar for runing application.

With a fat-jar you have to have Java installed and launching the Spark application requires executing java -jar [your-fat-jar-here]. It's hard to automate it if you want to, say, launch the application from a web application.

With SparkLauncher you're given the option of launching a Spark application from another application, e.g. the web application above. It is just much easier.

import org.apache.spark.launcher.SparkLauncher

SparkLauncher extends App {

val spark = new SparkLauncher()
.setSparkHome("/home/knoldus/spark-1.4.0-bin-hadoop2.6")
.setAppResource("/home/knoldus/spark_launcher-assembly-1.0.jar")
.setMainClass("SparkApp")
.setMaster("local[*]")
.launch();
spark.waitFor();

}

Code: https://github.com/phalodi/Spark-launcher

Here

  • setSparkHome(“/home/knoldus/spark-1.4.0-bin-hadoop2.6”) is use to set spark home which is use internally to call spark submit.

  • .setAppResource(“/home/knoldus/spark_launcher-assembly-1.0.jar”) is use to specify jar of our spark application.

  • .setMainClass(“SparkApp”) the entry point of the spark program i.e driver program.

  • .setMaster(“local[*]”) set the address of master where its start here now we run it on loacal machine.

  • .launch() is simply start our spark application

What are the benefits of SparkLauncher vs java -jar fat-jar?

https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-SparkLauncher.html

https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/launcher/SparkLauncher.html

http://henningpetersen.com/post/22/running-apache-spark-jobs-from-applications

like image 40
vaquar khan Avatar answered Sep 17 '22 18:09

vaquar khan