We are using AWS glue to convert JSON files stored in our S3 datalake.
Here are the steps that I followed,
Created a crawler for generating table on Glue from our datalake bucket which has JSON data.
The newly created tables have partitions as follows,
Name, Year, Month, day, hour
Created a glue job to convert it to Parquet and store in a different bucket
With these process, the jobs run successfully but the data in the new bucket is not partitioned. Its just comes under a single directory.
What I want to achieve is the converted parquet files should get the same partitions as in the source table/data lake bucket.
Also, i want to increase the file size of the parquet files(reduce the number of files).
Can anyone help me on this?
When you create the AWS Glue jobs, you can use either the IAM role that is attached or an existing role. The Python code uses the Pandas and PyArrow libraries to convert data to Parquet. The Pandas library is already available.
To convert JSON data files to Parquet, you need some in-memory representation. Parquet doesn't have its own set of Java… Once the parser validates the schema then the sample pay load will be parsed and converted it to the parquet format which is in binary format and can be read by machine.
Yes, we can convert the CSV/JSON files to Parquet using AWS Glue.
Short description. The AWS Glue crawler creates multiple tables when your source data files don't use the same: Format (such as CSV, Parquet, or JSON) Compression type (such as SNAPPY, gzip, or bzip2)
Try the below for writing the dynamic frame.
glueContext.write_dynamic_frame.from_options(
frame=<output_dataframe>,
connection_type="s3",
connection_options={"path": "s3://<output_bucket_path>",
"partitionKeys": ["Name", "Year", "Month" , "day", "hour"]},
format="parquet")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With