Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS Glue convert files from JSON to Parquet with same partitions as source table

We are using AWS glue to convert JSON files stored in our S3 datalake.

Here are the steps that I followed,

  1. Created a crawler for generating table on Glue from our datalake bucket which has JSON data.

  2. The newly created tables have partitions as follows,

    Name, Year, Month, day, hour

  3. Created a glue job to convert it to Parquet and store in a different bucket

With these process, the jobs run successfully but the data in the new bucket is not partitioned. Its just comes under a single directory.

What I want to achieve is the converted parquet files should get the same partitions as in the source table/data lake bucket.

Also, i want to increase the file size of the parquet files(reduce the number of files).

Can anyone help me on this?

like image 747
Vishnu Prassad Avatar asked Feb 12 '18 04:02

Vishnu Prassad


People also ask

Can AWS Glue convert to Parquet?

When you create the AWS Glue jobs, you can use either the IAM role that is attached or an existing role. The Python code uses the Pandas and PyArrow libraries to convert data to Parquet. The Pandas library is already available.

Can JSON be converted to Parquet?

To convert JSON data files to Parquet, you need some in-memory representation. Parquet doesn't have its own set of Java… Once the parser validates the schema then the sample pay load will be parsed and converted it to the parquet format which is in binary format and can be read by machine.

Can AWS Glue convert CSV to Parquet?

Yes, we can convert the CSV/JSON files to Parquet using AWS Glue.

Why is glue crawler creating multiple tables?

Short description. The AWS Glue crawler creates multiple tables when your source data files don't use the same: Format (such as CSV, Parquet, or JSON) Compression type (such as SNAPPY, gzip, or bzip2)


1 Answers

Try the below for writing the dynamic frame.

glueContext.write_dynamic_frame.from_options(
frame=<output_dataframe>,
connection_type="s3",
connection_options={"path": "s3://<output_bucket_path>",
                    "partitionKeys": ["Name", "Year", "Month" , "day", "hour"]},
format="parquet")
like image 82
kmn Avatar answered Oct 12 '22 01:10

kmn