Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Presto Interpreter in Zeppelin on EMR

Is it possible to add Presto interpreter to Zeppelin on AWS EMR 4.3 and if so, could someone please post the instructions? I have Presto-Sandbox and Zeppelin-Sandbox running on EMR.

like image 487
shomi Avatar asked Mar 08 '16 03:03

shomi


1 Answers

There's no official Presto interpreter for Zeppelin, and the conclusion of the Jira ticket raised is that it's not necessary because you can just use the jdbc interpreter

https://issues.apache.org/jira/browse/ZEPPELIN-27

I'm running a later EMR with presto & zeppelin, and the default set of interpreters doesn't include jdbc, but it can be installed using a ssh to the master node and running

sudo /usr/lib/zeppelin/bin/install-interpreter.sh --name jdbc

Even better is to use that as a bootstrap script.

Then you can add a new interpreter in Zeppelin.

  1. Click the login-name drop down in the top right of Zeppelin
  2. Click Interpreter
  3. Click +Create

Give it a name like presto, meaning you need to use %presto as a directive on the first line of a paragraph in zeppelin, or set it as the default interpreter.

The settings you need here are:

default.driver com.facebook.presto.jdbc.PrestoDriver

default.url jdbc:presto://<YOUR EMR CLUSTER MASTER DNS>:8889

default.user hadoop

Note there's no password provided because the EMR environment should be using IAM roles, and ppk keys etc for authentication.

You will also need a Dependency for the presto JDBC driver jar. There's multiple ways to add dependencies in Zeppelin, but one easy way is via a maven groupid:artifactid:version reference in the interpreter settings under Dependencies

e.g. under artifact

com.facebook.presto:presto-jdbc:0.170

Note the version 0.170 corresponds to the version of Presto currently deployed on EMR, which will change in the future. You can see in the AWS EMR settings which version is being deployed to your cluster.

You can also get Zeppelin to connect directly to a catalog, or a catalog & schema by appending them to the default.url setting As per the Presto docs for the JDBC driver https://prestodb.io/docs/current/installation/jdbc.html

e.g. As an example, using Presto with a hive metastore with a database called datakeep

jdbc:presto://<YOUR EMR CLUSTER MASTER DNS>:8889/hive

OR

jdbc:presto://<YOUR EMR CLUSTER MASTER DNS>:8889/hive/datakeep

UPDATE Feb 2018

EMR 5.11.1 is using presto 0.187 and there is an issue in the way Zeppelin interpreter provides properties to the Presto Driver, causing an error something like Unrecognized connection property 'url'

Currently the only solutions appear to be using an older version in the artifact, or manually uploading a patched presto driver See https://github.com/prestodb/presto/issues/9254 and https://issues.apache.org/jira/browse/ZEPPELIN-2891

In my case using an old reference to a driver (apparently must be older than 0.180) e.g. com.facebook.presto:presto-jdbc:0.179 did not work, and zeppelin gave me an error about can't download dependencies. Funny error but probably something to do with Zeppelin's local maven repo not containing this, not sure I gave up on that.

I can confirm that patching the driver works.

  • (Assuming you have java & maven installed)
  • Clone the presto github repo
  • Checkout the release tag e.g. git checkout 0.187
  • Make the edits as per that patch https://groups.google.com/group/presto-users/attach/1231343dbdd09/presto-jdbc.diff?part=0.1&authuser=0
  • Build the jar using mvn clean package
  • Copy the jar to the zeppelin machine somewhere zeppelin user has permission to read.
  • In the interpreter, under the Dependencies - Artifacts section, instead of a maven reference use the absolute path to that jar file.
  • There appears to be an issue passing the user to the presto driver, so just add it to the "default.url" jdbc connection string as a url parameter, e.g. jdbc:presto://<YOUR EMR CLUSTER MASTER DNS>:8889?user=hadoop

Up and running. Meanwhile, it might be worth considering Athena as an alternative to Presto give it's serverless & is effectively just a fork of Presto. It does have limitation to External hive tables only, and they must be created in Athena's own catalog (or now in AWS Glue catalog, also restricted to External tables).

like image 113
Davos Avatar answered Nov 03 '22 03:11

Davos