Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

use SQL inside AWS Glue pySpark script

I want to use AWS Glue to convert some csv data to orc.
The ETL job I created generated the following PySpark script:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "tests", table_name = "test_glue_csv", transformation_ctx = "datasource0")

applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("id", "int", "id", "int"), ("val", "string", "val", "string")], transformation_ctx = "applymapping1")

resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2")

dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")

datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": "s3://glue/output"}, format = "orc", transformation_ctx = "datasink4")
job.commit()

It takes the csv data (from the location of which the Athena table tests.test_glue_csv points to) and outputs to s3://glue/output/.

How can I insert in this script some SQL manipulations?

Thanks

like image 985
belostoky Avatar asked Aug 22 '17 09:08

belostoky


People also ask

Does AWS Glue use SQL?

AWS Glue Studio now provides the option to define transforms using SQL queries, allowing you to perform aggregations, easily apply filter logic to your data, add calculated fields, and more. This feature makes it easy to seamlessly mix SQL queries with AWS Glue Studio's visual transforms while authoring ETL jobs.

Can I use PySpark in AWS Glue?

Using Python with AWS GlueAWS Glue supports an extension of the PySpark Python dialect for scripting extract, transform, and load (ETL) jobs.

Does glue support spark SQL?

You can configure your AWS Glue jobs and development endpoints to use the Data Catalog as an external Apache Hive metastore. You can then directly run Apache Spark SQL queries against the tables stored in the Data Catalog. AWS Glue dynamic frames integrate with the Data Catalog by default.


1 Answers

You should first create a temp view/table from your dynamic frame

dyf.toDF().createOrReplaceTempView("view_dyf")

Here, dyf is your dynamic frame.

Then, use your spark object to apply sql queries on it

sqlDF = spark.sql("select * from view_dyf")
sqlDF.show()
like image 117
NPCRNPCR Avatar answered Sep 19 '22 21:09

NPCRNPCR