use SQL inside AWS Glue pySpark script

Tags:

I want to use AWS Glue to convert some csv data to orc.
The ETL job I created generated the following PySpark script:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "tests", table_name = "test_glue_csv", transformation_ctx = "datasource0")

applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("id", "int", "id", "int"), ("val", "string", "val", "string")], transformation_ctx = "applymapping1")

resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2")

dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")

datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": "s3://glue/output"}, format = "orc", transformation_ctx = "datasink4")
job.commit()

It takes the csv data (from the location of which the Athena table tests.test_glue_csv points to) and outputs to s3://glue/output/.

How can I insert in this script some SQL manipulations?

Thanks

985

asked Aug 22 '17 09:08

belostoky

1 Answers

You should first create a temp view/table from your dynamic frame

dyf.toDF().createOrReplaceTempView("view_dyf")

Here, dyf is your dynamic frame.

Then, use your spark object to apply sql queries on it

sqlDF = spark.sql("select * from view_dyf")
sqlDF.show()

117

answered Sep 19 '22 21:09

NPCRNPCR

Related questions
                            
                                Disable pm2 log creation in node js
                            
                                Good setup on AWS for ELK
                            
                                Can I attach a CloudWatch event rule to a 'built-in target' via Terraform?
                            
                                Amazon Aurora Replica
                            
                                Django Invalid HTTP_HOST header on Apache - fix using Require?
                            
                                AWS API Gatewat with proxy Lambda: Invalid permissions on Lambda function
                            
                                How do I change what keypair my LightSail instance uses?
                            
                                Transactions with DynamoDB library Boto3
                            
                                Passing serverless API Gateway URL as a parameter for a Lambda function in the same stack
                            
                                DynamoDB slow response
                            
                                Download file from Amazon S3 using REST API
                            
                                AWS Elastic Beanstalk : the command eb list shows no environments
                            
                                AWS Athena flattened data from nested JSON source
                            
                                How to verify volume successfully created/attached in boto3?
                            
                                How does AWS Athena react to schema changes in S3 files?
                            
                                AWS S3 sync between buckets overwriting newer destination files
                            
                                Remove a file from invalidation [Amazon AWS CloudFront]
                            
                                When I check nginx access.log, unknown HEAD requests come in periodically
                            
                                Custom redirection rules on S3 returns 403 when using CloudFront
                            
                                Lambda in VPC won't create new ENI after an ENI has been manually detached from subnet

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

use SQL inside AWS Glue pySpark script

Tags:

amazon-web-services

pyspark

pyspark-sql

amazon-athena

aws-glue

belostoky

People also ask

1 Answers

NPCRNPCR

Recent Activity

Donate For Us