Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to execute aws glue scripts using python 2.7 from local machine?

I have aws cli and boto3 installed in my python 2.7 environment. I want to do various operations like get schema information, get database details of all the tables present in AWS Glue console. I tried below samples of scripts:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

glueContext = GlueContext(SparkContext.getOrCreate())
persons = glueContext.create_dynamic_frame.from_catalog(
             database="records",
             table_name="recordsrecords_converted_json")
print "Count: ", persons.count()
persons.printSchema()

I got error ImportError: No module named awsglue.transforms which should be correct as there is no such package present in boto3 as I identified using the command dir(boto3). I found that boto3 offers various client calls through awscli and we can access them by using client=boto3.client('glue'). So, for getting schema information as above, I tried below sample code:

import sys
import boto3
client=boto3.client('glue')
response = client.get_databases(
    CatalogId='string',
    NextToken='string',
    MaxResults=123
)
print client

But then I get this error: AccessDeniedException: An error occurred (AccessDeniedException) when calling the GetDatabases operation: Cross account access is not allowed.

I am pretty sure that either one of them or probably both of them are correct approaches to get what I am trying to get but something doesn't fall into correct slots here. Any ideas to get the details about the schema and database tables from AWS Glue using python 2.7 locally like I tried above?

like image 200
CodeHunter Avatar asked Feb 21 '18 20:02

CodeHunter


1 Answers

The following code works for me, and am using locally setup Zeppelin notebook, as a dev end point. The printschema reads the schema from the data catalog.

Hope you have enabled the ssh tunnelling as well.

%pyspark
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.transforms import *
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

# Create a Glue context
glueContext = GlueContext(SparkContext.getOrCreate())

# Create a DynamicFrame using the 'persons_json' table
medicare_dynamicframe = glueContext.create_dynamic_frame.from_catalog(database="payments", table_name="medicaremedicare_hospital_provider_csv")

# Print out information about this data
print "Count:  ", medicare_dynamicframe.count()
medicare_dynamicframe.printSchema()

Also you may need to make some changes for Spark interpreter, (tick on the Connect to existing process option available in the top, and host(localhost), port number (9007).

For second part You need to to do aws configure and then create glue client after installing boto3 client. After this, check your proxy settings for hiding behind a firewall or company network.

To be clear, boto3 client is helpful for all AWS related client side api and for server side, Zeppelin way is the best.

Hope this helps.

like image 86
Yuva Avatar answered Sep 28 '22 01:09

Yuva