Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Presto on Amazon S3

I'm trying to use Presto on Amazon S3 bucket, but haven't found much related information on the Internet.

I've installed Presto on a micro instance but I'm not able to figure out how I could connect to S3. There is a bucket and there are files in it. I have a running hive metastore server and I have configured it in presto hive.properties. But when I try to run the LOCATION command in hive, its not working.

IT throws an error saying cannot find the file scheme type s3.

And also I do not know why we need to run hadoop but without hadoop the hive doesnt run. Is there any explanation to this.

This and this are the documentations i've followed while set up.

like image 373
Codex Avatar asked May 09 '16 06:05

Codex


People also ask

Can Presto connect to S3?

Because of this, Presto has a lot of connectors, including to non-relational sources like the Hadoop Distributed File System (HDFS), Amazon S3, Cassandra, MongoDB, and HBase, and relational sources such as MySQL, PostgreSQL, Amazon Redshift, Microsoft SQL Server, and Teradata.

How does Presto query S3?

IAM Role: Query S3 with Presto (Recommended Approach) With this setting, the Presto server will have access to all the buckets that are accessible using the IAM role that the instance is bound to. The Hive Metastore running also needs to have access to those buckets and need to be bound to the same IAM role.

Is Presto and Athena same?

If you're starting from scratch, you should consider Athena. It's basically serverless Presto as a service, without the headache of having to set a lot up. Just point it at data and get started for $5 per terabyte scanned. There are several limitations though.

Does Amazon Athena use Presto?

Amazon Athena uses Presto with full standard SQL support and works with a variety of standard data formats, including CSV, JSON, ORC, Avro, and Parquet. Athena can handle complex analysis, including large joins, window functions, and arrays.


1 Answers

Presto uses the Hive metastore to map database tables to their underlying files. These files can exist on S3, and can be stored in a number of formats - CSV, ORC, Parquet, Seq etc.

The Hive metastore is usually populated through HQL (Hive Query Language) by issuing DDL statements like CREATE EXTERNAL TABLE ... with a LOCATION ... clause referencing the underlying files that hold the data.

In order to get Presto to connect to a Hive metastore you will need to edit the hive.properties file (EMR puts this in /etc/presto/conf.dist/catalog/) and set the hive.metastore.uri parameter to the thrift service of an appropriate Hive metastore service.

The Amazon EMR cluster instances will automatically configure this for you if you select Hive and Presto, so it's a good place to start.

If you want to test this on a standalone ec2 instance then I'd suggest that you first focus on getting a functional hive service working with the Hadoop infrastructure. You should be able to define tables that reside locally on the hdfs file system. Presto complements hive, but does require a functioning hive set-up, presto's native ddl statements are not as feature complete as hive, so you'll do most table creation from hive directly.

Alternatively, you can define Presto connectors for a mysql or postgresql database, but it's just a jdbc pass through do I don't think you'll gain much.

like image 119
Euan Avatar answered Oct 11 '22 06:10

Euan