I want to use AWS Glue to convert some csv data to orc.
The ETL job I created generated the following PySpark script:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "tests", table_name = "test_glue_csv", transformation_ctx = "datasource0")
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("id", "int", "id", "int"), ("val", "string", "val", "string")], transformation_ctx = "applymapping1")
resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2")
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": "s3://glue/output"}, format = "orc", transformation_ctx = "datasink4")
job.commit()
It takes the csv data (from the location of which the Athena table tests.test_glue_csv points to) and outputs to s3://glue/output/
.
How can I insert in this script some SQL manipulations?
Thanks
AWS Glue Studio now provides the option to define transforms using SQL queries, allowing you to perform aggregations, easily apply filter logic to your data, add calculated fields, and more. This feature makes it easy to seamlessly mix SQL queries with AWS Glue Studio's visual transforms while authoring ETL jobs.
Using Python with AWS GlueAWS Glue supports an extension of the PySpark Python dialect for scripting extract, transform, and load (ETL) jobs.
You can configure your AWS Glue jobs and development endpoints to use the Data Catalog as an external Apache Hive metastore. You can then directly run Apache Spark SQL queries against the tables stored in the Data Catalog. AWS Glue dynamic frames integrate with the Data Catalog by default.
You should first create a temp view/table from your dynamic frame
dyf.toDF().createOrReplaceTempView("view_dyf")
Here, dyf
is your dynamic frame.
Then, use your spark object to apply sql queries on it
sqlDF = spark.sql("select * from view_dyf")
sqlDF.show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With