I'm working in an environment where I have an S3 service being used as a data lake, but not AWS Athena. I'm trying to setup Presto to be able to query the data in S3 and I know I need the define the data structure as Hive tables through the Hive Metastore service. I'm deploying each component in Docker, so I'd like to keep the container size as minimal as possible. What components from Hive do I need to be able to just run the Metastore service? I don't really actually care about running Hive, just the Metastore. Can I trim down what's needed, or is there already a pre-configured package just for that? I haven't been able to find anything online that doesn't include downloading all of Hadoop and Hive. Is what I'm trying to do possible?
Hive Configuration Properties. The username Presto will use to access the Hive metastore. The default file format used when creating new tables.
Because of this, Presto has a lot of connectors, including to non-relational sources like the Hadoop Distributed File System (HDFS), Amazon S3, Cassandra, MongoDB, and HBase, and relational sources such as MySQL, PostgreSQL, Amazon Redshift, Microsoft SQL Server, and Teradata.
Beginning in Hive 3.0, the Metastore is released as a separate package and can be run without the rest of Hive. This is referred to as standalone mode. By default the Metastore is configured for use with Hive, so a few configuration parameters have to be changed in this configuration.
Presto does not include built in support for the Hadoop file system and it will need to leverage other tools such as Hive connector (aka HCatalog). One of the strengths of presto is that it's suitable for star schema models.
There is a workaround, that you do not need hive to run presto. However I haven't tried that with any distributed file system like s3, but code suggest it should work (at least with HDFS). In my opinion it is worth trying, because you do not need any new docker image for hive at all.
The idea is to use a builtin FileHiveMetastore. It is neither documented nor advised to be used in production but you could play with it. Schema information is stored next to the data in the file system. Obviously, it has its prons and cons. I do not know the details of your use case, so I don't know if it fits your needs.
Configuration:
connector.name=hive-hadoop2
hive.metastore=file
hive.metastore.catalog.dir=file:///tmp/hive_catalog
hive.metastore.user=cox
Demo:
presto:tiny> create schema hive.default;
CREATE SCHEMA
presto:tiny> use hive.default;
USE
presto:default> create table t (t bigint);
CREATE TABLE
presto:default> show tables;
Table
-------
t
(1 row)
Query 20180223_202609_00009_iuchi, FINISHED, 1 node
Splits: 18 total, 18 done (100.00%)
0:00 [1 rows, 18B] [11 rows/s, 201B/s]
presto:default> insert into t (values 1);
INSERT: 1 row
Query 20180223_202616_00010_iuchi, FINISHED, 1 node
Splits: 51 total, 51 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]
presto:default> select * from t;
t
---
1
(1 row)
After the above I was able to find the following on my machine:
/tmp/hive_catalog/
/tmp/hive_catalog/default
/tmp/hive_catalog/default/t
/tmp/hive_catalog/default/t/.prestoPermissions
/tmp/hive_catalog/default/t/.prestoPermissions/user_cox
/tmp/hive_catalog/default/t/.prestoPermissions/.user_cox.crc
/tmp/hive_catalog/default/t/.20180223_202616_00010_iuchi_79dee041-58a3-45ce-b86c-9f14e6260278.crc
/tmp/hive_catalog/default/t/.prestoSchema
/tmp/hive_catalog/default/t/20180223_202616_00010_iuchi_79dee041-58a3-45ce-b86c-9f14e6260278
/tmp/hive_catalog/default/t/..prestoSchema.crc
/tmp/hive_catalog/default/.prestoSchema
/tmp/hive_catalog/default/..prestoSchema.crc
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With