Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to add connectors to presto on Amazon EMR

I've set up a small EMR cluster with Hive/Presto installed, I want to query files on S3 and import them to Postgres on RDS.

To run queries on S3 and save the results in a table in postgres I've done the following:

  1. Started a 3 node EMR cluster from the AWS console.
  2. Manually SSH into the Master node to create an EXTERNAL table in hive, looking at an S3 bucket.
  3. Manually SSH into each of the 3 nodes and add a new catalog file:

    /etc/presto/conf.dist/catalog/postgres.properties 
    

    with the following contents

    connector.name=postgresql
    connection-url=jdbc:postgresql://ip-to-postgres:5432/database
    connection-user=<user>
    connection-password=<pass>
    

    and edited this file

    /etc/presto/conf.dist/config.properties
    

    adding

    datasources=postgresql,hive
    
  4. Restart presto by running the following manually on all 3 nodes

    sudo restart presto-server
    

This setup seems to work well.

In my application, there are multiple databases created dynamically. It seems that those configuration/catalog changes need to be made for each database and the server needs to be restarted to see the new config changes.

Is there a proper way for my application (using boto or other methods) to update configurations by

  1. Adding a new catalog file in all nodes /etc/presto/conf.dist/catalog/ for each new database
  2. Adding a new entry in all nodes in /etc/presto/conf.dist/config.properties
  3. Gracefully restarting presto across the whole cluster (ideally when it becomes idle, but that's not a major concern.
like image 360
Ger Avatar asked Oct 16 '25 20:10

Ger


1 Answers

I believe you can run a simple bash script to achieve what you want. There is no other way except creating a new cluster with --configurations parameter where you provide the desired configurations. You can run below script from the master node.

#!/bin/sh

# "cluster_nodes.txt" with private IP address of each node.
aws emr list-instances --cluster-id <cluster-id> --instance-states RUNNING | grep PrivateIpAddress | sed 's/"PrivateIpAddress"://' | sed 's/\"//g' | awk '{gsub(/^[ \t]+|[ \t]+$/,""); print;}' > cluster_nodes.txt

# For each IP connect with ssh and configure.
while IFS='' read -r line || [[ -n "$line" ]]; do
    echo "Connecting $line"
    scp -i <PEM file> postgres.properties hadoop@$line:/tmp;
    ssh -i <PEM file> hadoop@$line "sudo mv /tmp/postgres.properties /etc/presto/conf/catalog;sudo chown presto:presto /etc/presto/conf/catalog/postgres.properties;sudo chmod 644 /etc/presto/conf/catalog/postgres.properties; sudo restart presto-server";
done < cluster_nodes.txt
like image 199
Root G Avatar answered Oct 19 '25 12:10

Root G



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!