How can I include additional jars when starting a Google DataProc cluster to use with Jupyter notebooks?

1 Answers

The answer depends slightly on which jars you're looking to load. For example, you can use spark-xml with the following when creating a cluster:

$ gcloud dataproc clusters create [cluster-name] \
    --zone [zone] \
    --initialization-actions \
       gs://dataproc-initialization-actions/jupyter/jupyter.sh \ 
    --properties spark:spark.jars.packages=com.databricks:spark-xml_2.11:0.4.1

To specify multiple Maven coordinates, you will need to swap the gcloud dictionary separator character from ',' to something else (as we need to use that to separate the packages to install):

$ gcloud dataproc clusters create [cluster-name] \
    --zone [zone] \
    --initialization-actions \
       gs://dataproc-initialization-actions/jupyter/jupyter.sh \ 
    --properties=^#^spark:spark.jars.packages=artifact1,artifact2,artifact3

Details on how escape characters are changed can be found in gcloud:

$ gcloud help topic escaping

180

answered Oct 02 '22 00:10

Angus Davis

Related questions
                            
                                Spark Datasets - strong typing
                            
                                Spark Scala - How to group dataframe rows and apply complex function to the groups?
                            
                                Why does Spark exit with exitCode: 16?
                            
                                In Spark Streaming, is there a way to detect when a batch has finished?
                            
                                Is there an effective partitioning method when using reduceByKey in Spark?
                            
                                How to map struct in DataFrame to case class?
                            
                                run pyspark locally
                            
                                Python: How to convert Pyspark column to date type if there are null values
                            
                                How to use spark quantilediscretizer on multiple columns
                            
                                PySpark sampleBy using multiple columns
                            
                                How to interpret probability column in spark logistic regression prediction?
                            
                                How to specify the location of custom log4j.configuration when spark-submit to Amazon EMR?
                            
                                Unbounded table is spark structured streaming
                            
                                Visualizing topics with Spark LDA
                            
                                R - How to replicate rows in a spark dataframe using sparklyr
                            
                                Scala - How to split the probability column (column of vectors) that we obtain when we fit the GMM model to the data in to two separate columns? [duplicate]
                            
                                How does Spark SQL read compressed csv files?
                            
                                S3A: fails while S3: works in Spark EMR
                            
                                with pyspark.sql.functions unix_timestamp get null
                            
                                Streaming data store in hive using spark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I include additional jars when starting a Google DataProc cluster to use with Jupyter notebooks?

Tags:

jupyter-notebook

apache-spark

google-cloud-dataproc

seandavi

People also ask

1 Answers

Angus Davis

Recent Activity

Donate For Us