I am following the instructions for starting a Google DataProc cluster with an initialization script to start a jupyter notebook.
https://cloud.google.com/blog/big-data/2017/02/google-cloud-platform-for-data-scientists-using-jupyter-notebooks-with-apache-spark-on-google-cloud
How can I include extra JAR files (spark-xml, for example) in the resulting SparkContext in Jupyter notebooks (particularly pyspark)?
You can update a cluster by issuing a Dataproc API clusters. patch request, running a gcloud dataproc clusters update command in a local terminal window or in Cloud Shell, or by editing cluster parameters from the Configuration tab of the Cluster details page for the cluster in the Google Cloud console.
The answer depends slightly on which jars you're looking to load. For example, you can use spark-xml with the following when creating a cluster:
$ gcloud dataproc clusters create [cluster-name] \
--zone [zone] \
--initialization-actions \
gs://dataproc-initialization-actions/jupyter/jupyter.sh \
--properties spark:spark.jars.packages=com.databricks:spark-xml_2.11:0.4.1
To specify multiple Maven coordinates, you will need to swap the gcloud dictionary separator character from ',' to something else (as we need to use that to separate the packages to install):
$ gcloud dataproc clusters create [cluster-name] \
--zone [zone] \
--initialization-actions \
gs://dataproc-initialization-actions/jupyter/jupyter.sh \
--properties=^#^spark:spark.jars.packages=artifact1,artifact2,artifact3
Details on how escape characters are changed can be found in gcloud:
$ gcloud help topic escaping
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With