Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to create parquet file from xml/json input file without .avsc file and without impala/hive.?

Tags:

parquet

I want to convert my input file (xml/json) to parquet. I have already have one solution that works with spark, and creates required parquet file.

However, due to other client requirements, i might need to create a solution that does not involve hadoop eco system such as hive, impala, spark or mapreduce.

And, Kite SDK is using .avsc file to create parquet data, kindly correct me if i am wrong. I might be short sighted but, looks like it needs avro schema file. So, is there any library that can create parquet file from self explanatory files such as xml or json.?

Note: If it feels like not a proper approach, i would like to understand the reasons why it is not a recommended approach, so that i can earn some knowledge or understand the areas that i might have missed.

like image 218
Srini Avatar asked Oct 24 '25 17:10

Srini


1 Answers

I just published one using python.

https://github.com/blackrock/xml_to_parquet

Convert one or more XML files into Apache Parquet format. Only requires a XSD and XML file to get started.

It requires a XSD schema file to convert everything in your XML file into an equivalent parquet file with nested data structures that match XML paths.

Convert a small XML file to a Parquet file
python xml_to_parquet.py -x PurchaseOrder.xsd PurchaseOrder.xml

INFO - 2021-01-21 12:32:38 - Parsing XML Files..
INFO - 2021-01-21 12:32:38 - Processing 1 files
DEBUG - 2021-01-21 12:32:38 - Generating schema from PurchaseOrder.xsd
DEBUG - 2021-01-21 12:32:38 - Parsing PurchaseOrder.xml
DEBUG - 2021-01-21 12:32:38 - Saving to file PurchaseOrder.xml.parquet
DEBUG - 2021-01-21 12:32:38 - Completed PurchaseOrder.xml
like image 58
David Lee Avatar answered Oct 26 '25 16:10

David Lee



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!