Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS Glue: ETL to read S3 CSV files

I want to use ETL to read data from S3. Since with ETL jobs I can set DPU to hopefully speed things up.

But how do I do it? I tried

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

inputGDF = glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": ["s3://pinfare-glue/testing-csv"]}, format = "csv")
outputGDF = glueContext.write_dynamic_frame.from_options(frame = inputGDF, connection_type = "s3", connection_options = {"path": "s3://pinfare-glue/testing-output"}, format = "parquet")

But it appears there is nothing written. My folder looks like:

enter image description here

Whats incorrect? My output S3 only has a file like: testing_output_$folder$

like image 484
Jiew Meng Avatar asked Nov 01 '18 15:11

Jiew Meng


1 Answers

I believe the issue here is that you have subfolders within testing-csv folder and since you did not specify recurse to be true, Glue is not able to find the files in the 2018-09-26 subfolder (or in fact any other subfolders).

You need to add the recurse option as follows

inputGDF = glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": ["s3://pinfare-glue/testing-csv"], "recurse"=True}, format = "csv")

Also, regarding your question about crawlers in the comments, they help to infer the schema of your data files. So, in your case here does nothing since you are creating the dynamicFrame directly from s3.

like image 138
Saiful Rizal MDRamli Avatar answered Oct 23 '22 05:10

Saiful Rizal MDRamli