Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to add external jar to spark in HDInsight?

I am trying to install the Azure CosmosDB Spark connector in an HDInsight Spark Cluster on Azure. (Github)

I am new to the spark environment and i couldn't achieve a proper way to add the connector jars to spark config.

Methods I used :

Method 1 I uploaded the jars on the Azure Blob Storage Container associated with the HDInsight Cluster. (to example/jars/) I established an ssh connection with the spark cluster head node and ran the following :

spark-shell --master yarn --conf "spark.executor.extraClassPath="wasb:///example/jars/azure-cosmosdb-spark_2.0.2_2.11-0.0.3.jar" --conf "spar.driver.extraClassPath= "wasb:///example/jar/azure-cosmosdb-spark_2.0.2_2.11-0.0.3.jar"

The spark-shell returns the following:

SPARK_MAJOR_VERSION is set to 2, using Spark2
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
17/10/19 15:10:48 WARN SparkContext: Use an existing SparkContext, some configuration may not take effect.
Spark context Web UI available at http://10.0.0.20:4040
Spark context available as 'sc' (master = yarn, app id = application_1508418631913_0014).
Spark session available as 'spark'.

I think the problem here is the

SparkContext: Use an existing SparkContext, some configuration may not take effect.

Method 2

After uploading to /examples/jars same as first method. I opened the Ambari UI and added spark.executor.extraClassPath and spark.driver.extraClassPath to spark-Custom-Defaults with the same values mentioned in method 1.

Both Method 1 and Method 2 had no impact on my development environment. I tried to Import the com.microsoft.azure.cosmosdb and the interpreter couldn't find it.

Method 3 I Created an HDInsight 3.6 Spark Cluster (which is not recommended for my case because the connector is tested on HDInsight 3.5) and i added the configs to Livy Interpreter using Zeppelin. I tried the sample code found Here and i had this error popping :

java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.analysis.TypeCoercion$.findTightestCommonType()Lscala/Function2;

After some googling i thought it is a class version problem so i got back to HDInsight 3.5 and still no result.

My questions are :

Does the Spark-Shell --conf applies a persistent configuration or just for the shell session only ?

How can i achieve a proper configuration knowing that in the future i am going to use Livy REST API to execute remote PySpark jobs that may include this package and i don't want to run on configurations each time i submit a remote job ?

like image 515
Anis Tissaoui Avatar asked Nov 07 '22 15:11

Anis Tissaoui


1 Answers

You can add extra dependencies starting you spark-shell with:

spark-shell --packages maven-coordinates of the package

In you case:

    spark-shell --packages com.microsoft.azure:azure-cosmosdb-spark_2.1.0_2.11:jar:1.1.2

A good practice is to package your app with all its dependencies:

https://spark.apache.org/docs/latest/submitting-applications.html#bundling-your-applications-dependencies

This should work on livy as well.

like image 57
Thomas Nys Avatar answered Nov 14 '22 23:11

Thomas Nys