Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to register S3 Parquet files in a Hive Metastore using Spark on EMR

I am using Amazon Elastic Map Reduce 4.7.1, Hadoop 2.7.2, Hive 1.0.0, and Spark 1.6.1.

Use case: I have a Spark cluster used for processing data. That data is stored in S3 as Parquet files. I want tools to be able to query the data using names that are registered in the Hive Metastore (eg, looking up the foo table rather than the parquet.`s3://bucket/key/prefix/foo/parquet` style of doing things). I also want this data to persist for the lifetime of the Hive Metastore (a separate RDS instance) even if I tear down the EMR cluster and spin up a new one connected to the same Metastore.

Problem: if I do something like sqlContext.saveAsTable("foo") that will, by default, create a managed table in the Hive Metastore (see https://spark.apache.org/docs/latest/sql-programming-guide.html). These managed tables copy the data from S3 to HDFS on the EMR cluster, which means the metadata would be useless after tearing down the EMR cluster.

like image 374
Sam King Avatar asked Jul 21 '16 00:07

Sam King


People also ask

How do I connect to Hive Metastore?

Azure Databricks usage: Go to your Azure Databricks cluster, select Apps, and then select Launch Web Terminal. Run the cmdlet cat /databricks/hive/conf/hive-site. xml . Metastore JDBC URL: Provide the connection URL value and define the connection to the URL of the Metastore database server.


2 Answers

The solution was to register the S3 file as an external table.

sqlContext.createExternalTable("foo", "s3://bucket/key/prefix/foo/parquet")

I haven't figured out how to save a file to S3 and register it as an external table all in one shot, but createExternalTable doesn't add too much overhead.

like image 70
Sam King Avatar answered Nov 15 '22 09:11

Sam King


The way I solve this problem is: First Create the hive table in the spark:

schema = StructType([StructField("key", IntegerType(), True),StructField("value", StringType(), True)])
df = spark.catalog \
          .createTable("data1", "s3n://XXXX-Buket/data1",schema=schema)

Next, in Hive, it will appear the table that created from spark as above. (in this case data1)

In addition, in the other hive engine, you can link to this data is S3 by create external table data with the same type as created in spark: command:

CREATE EXTERNAL TABLE data1 (key INT, value String) STORED AS PARQUET LOCATION 's3n://XXXX-Buket/data1’
like image 29
Ho Thuan Avatar answered Nov 15 '22 10:11

Ho Thuan