I am trying to install the Azure CosmosDB Spark connector in an HDInsight Spark Cluster on Azure. (Github)
I am new to the spark environment and i couldn't achieve a proper way to add the connector jars to spark config.
Methods I used :
Method 1 I uploaded the jars on the Azure Blob Storage Container associated with the HDInsight Cluster. (to example/jars/) I established an ssh connection with the spark cluster head node and ran the following :
spark-shell --master yarn --conf "spark.executor.extraClassPath="wasb:///example/jars/azure-cosmosdb-spark_2.0.2_2.11-0.0.3.jar" --conf "spar.driver.extraClassPath= "wasb:///example/jar/azure-cosmosdb-spark_2.0.2_2.11-0.0.3.jar"
The spark-shell returns the following:
SPARK_MAJOR_VERSION is set to 2, using Spark2
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
17/10/19 15:10:48 WARN SparkContext: Use an existing SparkContext, some configuration may not take effect.
Spark context Web UI available at http://10.0.0.20:4040
Spark context available as 'sc' (master = yarn, app id = application_1508418631913_0014).
Spark session available as 'spark'.
I think the problem here is the
SparkContext: Use an existing SparkContext, some configuration may not take effect.
Method 2
After uploading to /examples/jars same as first method. I opened the Ambari UI and added spark.executor.extraClassPath and spark.driver.extraClassPath to spark-Custom-Defaults with the same values mentioned in method 1.
Both Method 1 and Method 2 had no impact on my development environment. I tried to Import the com.microsoft.azure.cosmosdb and the interpreter couldn't find it.
Method 3 I Created an HDInsight 3.6 Spark Cluster (which is not recommended for my case because the connector is tested on HDInsight 3.5) and i added the configs to Livy Interpreter using Zeppelin. I tried the sample code found Here and i had this error popping :
java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.analysis.TypeCoercion$.findTightestCommonType()Lscala/Function2;
After some googling i thought it is a class version problem so i got back to HDInsight 3.5 and still no result.
My questions are :
Does the Spark-Shell --conf applies a persistent configuration or just for the shell session only ?
How can i achieve a proper configuration knowing that in the future i am going to use Livy REST API to execute remote PySpark jobs that may include this package and i don't want to run on configurations each time i submit a remote job ?
You can add extra dependencies starting you spark-shell with:
spark-shell --packages maven-coordinates of the package
In you case:
spark-shell --packages com.microsoft.azure:azure-cosmosdb-spark_2.1.0_2.11:jar:1.1.2
A good practice is to package your app with all its dependencies:
https://spark.apache.org/docs/latest/submitting-applications.html#bundling-your-applications-dependencies
This should work on livy as well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With