I have the following problem.
The code below is auto-generated by AWS Glue.
It's mission is to data from Athena (backed up by .csv @ S3) and transform data into Parquet.
The code is working for the reference flight dataset and for some relatively big tables (~100 Gb).
However, in most cases it returns the error, which does not tell me much.
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkConf, SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
conf = (SparkConf()
.set("spark.driver.maxResultSize", "8g"))
sc = SparkContext(conf=conf)
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "XXX", table_name = "csv_impressions", transformation_ctx = "datasource0")
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("event time", "long", "event_time", "long"), ("user id", "string", "user_id", "string"), ("advertiser id", "long", "advertiser_id", "long"), ("campaign id", "long", "campaign_id", "long")], transformation_ctx = "applymapping1")
resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2")
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": "s3://xxxx"}, format = "parquet", transformation_ctx = "datasink4")
job.commit()
The error message identified by AWS Glue is:
An error occurred while calling o72.pyWriteDynamicFrame
The log file also contains:
Job aborted due to stage failure: ... Task failed while writing rows
Any idea how to find out the reason for failure?
Or what it could be?
Part 1: identifying the problem
The solution how to find what is causing the problem was to switch output from .parquet
to .csv
and drop ResolveChoice
or DropNullFields
(as it is automatically suggested by Glue for .parquet
):
datasink2 = glueContext.write_dynamic_frame.from_options(frame = applymapping1, connection_type = "s3", connection_options = {"path": "s3://xxxx"}, format = "csv", transformation_ctx = "datasink2")
job.commit()
It has produced the more detailed error message:
An error occurred while calling o120.pyWriteDynamicFrame. Job aborted due to stage failure: Task 5 in stage 0.0 failed 4 times, most recent failure: Lost task 5.3 in stage 0.0 (TID 182, ip-172-31-78-99.ec2.internal, executor 15): com.amazonaws.services.glue.util.FatalException: Unable to parse file: xxxx1.csv.gz
The file xxxx1.csv.gz
mentioned in the error message appears to be too big for Glue (~100Mb .gzip
and ~350Mb as uncompressed .csv
).
Part 2: true source of the problem and fix
As mentioned in the 1st part thanks to export to .csv
it was possible to identify the wrong file.
Further investigation by loading .csv into R has revealed that one of the columns contains a single string
record, while all other values of this column were long
or NULL
.
After dropping this value in R and re-uploading data to S3 the problem vanished.
Note #1: the column was declared string
in Athena so I consider this behaviour as bug
Note #2: the nature of the problem was not the size of the data. I have successfuly processed files up to 200Mb .csv.gz
which correspond to roughtly 600 Mb .csv
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With