Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does presto require a hive metastore to read parquet files from S3?

I am trying to generate parquet files in S3 file using spark with the goal that presto can be used later to query from parquet. Basically, there is how it looks like,

Kafka-->Spark-->Parquet<--Presto

I am able to generate parquet in S3 using Spark and its working fine. Now, I am looking at presto and what I think I found is that it needs hive meta store to query from parquet. I could not make presto read my parquet files even though parquet saves the schema. So, does it mean at the time of creating the parquet files, the spark job has to also store metadata in hive meta store?

If that is the case, can someone help me find an example of how it's done. To add to the problem, my data schema is changing, so to handle it, I am creating a programmatic schema in spark job and applying it while creating parquet files. And, if I am creating the schema in hive metastore, it needs to be done keeping this in consideration.

Or could you shed light on it if there is any better alternative way?

like image 330
Dangerous Scholar Avatar asked Oct 17 '22 13:10

Dangerous Scholar


1 Answers

You keep the Parquet files on S3. Presto's S3 capability is a subcomponent of the Hive connector. As you said, you can let Spark define tables in Spark or you can use Presto for that, e.g.

create table hive.default.xxx (<columns>) 
with (format = 'parquet', external_location = 's3://s3-bucket/path/to/table/dir');

(Depending on Hive metastore version and its configuration, you might need to use s3a instead of s3.)

Technically, it should be possible to create a connector that infers tables' schemata from Parquet headers, but I'm not aware of an existing one.

like image 54
Piotr Findeisen Avatar answered Oct 21 '22 04:10

Piotr Findeisen