Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark-Submit: --packages vs --jars

Can someone explain the differences between --packages and --jars in a spark-submit script?

nohup ./bin/spark-submit   --jars ./xxx/extrajars/stanford-corenlp-3.8.0.jar,./xxx/extrajars/stanford-parser-3.8.0.jar \
--packages datastax:spark-cassandra-connector_2.11:2.0.7 \
--class xxx.mlserver.Application \
--conf spark.cassandra.connection.host=192.168.0.33 \
--conf spark.cores.max=4 \
--master spark://192.168.0.141:7077  ./xxx/xxxanalysis-mlserver-0.1.0.jar   1000  > ./logs/nohup.out &

Also, do I require the--packages configuration if the dependency is in my applications pom.xml? (I ask because I just blew up my applicationon by changing the version in --packages while forgetting to change it in the pom.xml)

I am using the --jars currently because the jars are massive (over 100GB) and thus slow down the shaded jar compilation. I admit I am not sure why I am using --packages other than because I am following datastax documentation

like image 928
Jake Avatar asked Jul 20 '18 03:07

Jake


People also ask

What is jar file in spark submit?

Spark JAR files let you package a project into a single file so it can be run on a Spark cluster. A lot of developers develop Spark code in brower based notebooks because they're unfamiliar with JAR files.

What are jars in spark?

jars are like a bundle of java code files. Each library that I install that internally uses spark (or pyspark) has its own jar files that need to be available with both driver and executors in order for them to execute the package API calls that the user interacts with.

What happens when we do spark submit?

What happens when a Spark Job is submitted? When a client submits a spark user application code, the driver implicitly converts the code containing transformations and actions into a logical directed acyclic graph (DAG).

Where do I put .JAR files in spark?

To add JARs to a Spark job, --jars option can be used to include JARs on Spark driver and executor classpaths. If multiple JAR files need to be included, use comma to separate them. The following is an example: spark-submit --jars /path/to/jar/file1,/path/to/jar/file2 ...


1 Answers

if you do spark-submit --help it will show:

--jars JARS                 Comma-separated list of jars to include on the driver
                              and executor classpaths.

--packages                  Comma-separated list of maven coordinates of jars to include
                              on the driver and executor classpaths. Will search the local
                              maven repo, then maven central and any additional remote
                              repositories given by --repositories. The format for the
                              coordinates should be groupId:artifactId:version.

if it is --jars

then spark doesn't hit maven but it will search specified jar in the local file system it also supports following URL scheme hdfs/http/https/ftp.

so if it is --packages

then spark will search specific package in local maven repo then central maven repo or any repo provided by --repositories and then download it.

Now Coming back to your questions:

Also, do I require the--packages configuration if the dependency is in my applications pom.xml?

Ans: No, If you are not importing/using classes in jar directly but need to load classes by some class loader or service loader (e.g. JDBC Drivers). Yes otherwise.

BTW, If you are using specific version of specific jar in your pom.xml then why dont you make uber/fat jar of your application or provide dependency jar in --jars argument ? instead of using --packages

links to refer:

spark advanced-dependency-management

add-jars-to-a-spark-job-spark-submit

like image 137
nomadSK25 Avatar answered Oct 17 '22 08:10

nomadSK25