Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Setting spark classpaths on EC2: spark.driver.extraClassPath and spark.executor.extraClassPath

Reducing size of application jar by providing spark- classPath for maven dependencies:

My cluster is having 3 ec2 instances on which hadoop and spark is running.If I build jar with maven dependencies, it becomes too large(around 100 MB) which I want to avoid this as Jar is getting replicating on all nodes ,each time I run the job.

To avoid that I have build a maven package as "maven package".For dependency resolution I have downloaded the all maven dependencies on each node and then only provided above below jar paths:

I have added class paths on each node in the "spark-defaults.conf" as

spark.driver.extraClassPath        /home/spark/.m2/repository/com/google/code/gson/gson/2.3.1/gson-2.3.1.jar:/home/spark/.m2/repository/com/datastax/cassandra/cassandra-driver-core/2.1.5/cassandra-driver-core-2.1.5.jar:/home/spark/.m2/repository/com/google/guava/guava/16.0.1/guava-16.0.1.jar:/home/spark/.m2/repository/com/google/collections/google-collections/1.0/google-collections-1.0.jar:/home/spark/.m2/repository/com/datastax/spark/spark-cassandra-connector-java_2.10/1.2.0-rc1/spark-cassandra-connector-java_2.10-1.2.0-rc1.jar:/home/spark/.m2/repository/com/datastax/spark/spark-cassandra-connector_2.10/1.2.0-rc1/spark-cassandra-connector_2.10-1.2.0-rc1.jar:/home/spark/.m2/repository/org/apache/cassandra/cassandra-thrift/2.1.3/cassandra-thrift-2.1.3.jar:/home/spark/.m2/repository/org/joda/joda-convert/1.2/joda-convert-1.2.jar

It has worked,locally on single node. Still i am getting this error.Any help will be appreciated.

like image 619
Abhinandan Satpute Avatar asked Jul 29 '15 13:07

Abhinandan Satpute


People also ask

What mechanism does spark communicate with driver and executor?

Spark uses a master/slave architecture. As you can see in the figure, it has one central coordinator (Driver) that communicates with many distributed workers (executors). The driver and each of the executors run in their own Java processes.

What is task and executor in spark?

There is a distributing agent called spark executor which is responsible for executing the given tasks. Executors in Spark are the worker nodes that help in running individual tasks by being in charge of a given spark job.

How do I add jars to spark shell?

You can also add jars using Spark submit option --jar , using this option you can add a single jar or multiple jars by comma-separated.


1 Answers

Finally, I was able to solve the problem. I have created application jar using "mvn package" instead of "mvn clean compile assembly:single ",so that it will not download the maven dependencies while creating jar(But need to provide these jar/dependencies run-time) which resulted in small size Jar(as there is only reference of dependencies).

Then, I have added below two parameters in spark-defaults.conf on each node as:

spark.driver.extraClassPath     /home/spark/.m2/repository/com/datastax/cassandra/cassandra-driver-core/2.1.7/cassandra-driver-core-2.1.7.jar:/home/spark/.m2/repository/com/googlecode/json-simple/json-simple/1.1/json-simple-1.1.jar:/home/spark/.m2/repository/com/google/code/gson/gson/2.3.1/gson-2.3.1.jar:/home/spark/.m2/repository/com/google/guava/guava/16.0.1/guava-16.0.1.jar

spark.executor.extraClassPath     /home/spark/.m2/repository/com/datastax/cassandra/cassandra-driver-core/2.1.7/cassandra-driver-core-2.1.7.jar:/home/spark/.m2/repository/com/googlecode/json-simple/json-simple/1.1/json-simple-1.1.jar:/home/spark/.m2/repository/com/google/code/gson/gson/2.3.1/gson-2.3.1.jar:/home/spark/.m2/repository/com/google/guava/guava/16.0.1/guava-16.0.1.jar

So question arises that,how application JAR will get the maven dependencies(required jar's) run-time?

For that I have downloaded all required dependencies on each node using mvn clean compile assembly:single in advance.

like image 198
Abhinandan Satpute Avatar answered Sep 24 '22 19:09

Abhinandan Satpute