How to register S3 Parquet files in a Hive Metastore using Spark on EMR

I am using Amazon Elastic Map Reduce 4.7.1, Hadoop 2.7.2, Hive 1.0.0, and Spark 1.6.1.

Use case: I have a Spark cluster used for processing data. That data is stored in S3 as Parquet files. I want tools to be able to query the data using names that are registered in the Hive Metastore (eg, looking up the foo table rather than the parquet.`s3://bucket/key/prefix/foo/parquet` style of doing things). I also want this data to persist for the lifetime of the Hive Metastore (a separate RDS instance) even if I tear down the EMR cluster and spin up a new one connected to the same Metastore.

Problem: if I do something like sqlContext.saveAsTable("foo") that will, by default, create a managed table in the Hive Metastore (see https://spark.apache.org/docs/latest/sql-programming-guide.html). These managed tables copy the data from S3 to HDFS on the EMR cluster, which means the metadata would be useless after tearing down the EMR cluster.

How do I connect to Hive Metastore?

Azure Databricks usage: Go to your Azure Databricks cluster, select Apps, and then select Launch Web Terminal. Run the cmdlet cat /databricks/hive/conf/hive-site. xml . Metastore JDBC URL: Provide the connection URL value and define the connection to the URL of the Metastore database server.

The solution was to register the S3 file as an external table.

sqlContext.createExternalTable("foo", "s3://bucket/key/prefix/foo/parquet")

I haven't figured out how to save a file to S3 and register it as an external table all in one shot, but createExternalTable doesn't add too much overhead.

The way I solve this problem is: First Create the hive table in the spark:

schema = StructType([StructField("key", IntegerType(), True),StructField("value", StringType(), True)])
df = spark.catalog \
          .createTable("data1", "s3n://XXXX-Buket/data1",schema=schema)

Next, in Hive, it will appear the table that created from spark as above. (in this case data1)

In addition, in the other hive engine, you can link to this data is S3 by create external table data with the same type as created in spark: command:

CREATE EXTERNAL TABLE data1 (key INT, value String) STORED AS PARQUET LOCATION 's3n://XXXX-Buket/data1’

How to register S3 Parquet files in a Hive Metastore using Spark on EMR

Tags:

apache-spark

hive

elastic-map-reduce

apache-spark-1.6

Sam King

People also ask

2 Answers

Sam King

Ho Thuan

Recent Activity

Donate For Us

How to register S3 Parquet files in a Hive Metastore using Spark on EMR

Tags:

apache-spark

hive

elastic-map-reduce

apache-spark-1.6

Sam King

People also ask

2 Answers

Sam King

Ho Thuan

Related questions

Recent Activity

Donate For Us