Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark (Step/Job) on EMR cannot connect to AWS Glue Data Catalog but Zeppelin can

I have setup an EMR cluster with Data Catalog enabled

enter image description here

I can access the data catalog when I use Zeppelin, but with jobs/steps I submit like:

aws emr add-steps --cluster-id j-XXXXXX --steps "Type=spark,Name=Test,Args=[--deploy-mode,cluster,--master,yarn,--conf,spark.yarn.submit.waitAppCompletion=false,--num-executors,2,--executor-cores,2,--executor-memory,8g,s3://XXXXXX/emr-test.py],ActionOnFailure=CONTINUE"

I cannot see my data catalog when I use spark.sql("USE xxx") OR spark.sql("SHOW DATABASES") why is that.

from pyspark import SparkContext
from pyspark.sql import SparkSession

sc = SparkContext()
spark = SparkSession \
    .builder \
    .appName("Test") \
    .config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
    .getOrCreate()

spark.sql("USE ...")
spark.sql("SHOW TABLES").show()
spark.sql("SELECT querydatetime FROM flights LIMIT 10").show(10)

sc.stop()

I get something like:

pyspark.sql.utils.AnalysisException: u"Database 'xxxxxx' not found;"
like image 701
Jiew Meng Avatar asked Sep 17 '25 16:09

Jiew Meng


1 Answers

I found out from https://michael.ransley.co/2018/08/28/spark-glue.html that

To access the tables from within a Spark step you need to instantiate the spark session with the glue catalog:

spark = SparkSession.builder \
    .appName(job_name) \
    .config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
    .enableHiveSupport() \
    .getOrCreate()
spark.catalog.setCurrentDatabase("mydatabase")

I am missing the line .enableHiveSupport(). Its quite unfortunate this does not seem to be documented in the official docs ...

like image 86
Jiew Meng Avatar answered Sep 19 '25 06:09

Jiew Meng