What is the simple way to write Parquet Format to HDFS (using Java API) by directly creating Parquet Schema of a Pojo, without using avro and MR?
The samples I found were outdated and uses deprecated methods also uses one of Avro, spark or MR.
PARQUET. AVRO is a row-based storage format, whereas PARQUET is a columnar-based storage format. PARQUET is much better for analytical querying, i.e., reads and querying are much more efficient than writing. Writiing operations in AVRO are better than in PARQUET.
Avro is fast in retrieval, Parquet is much faster. parquet stores data on disk in a hybrid manner. It does a horizontal partition of the data and stores each partition it in a columnar way.
What is Parquet? Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.
Effectively, there is not a lot of sample available for reading/writing Apache parquet files without the help of an external framework.
The core parquet library is parquet-column where you can find some test files reading/writing directly : https://github.com/apache/parquet-mr/blob/master/parquet-column/src/test/java/org/apache/parquet/io/TestColumnIO.java
You then just need to use the same functionality with an HDFS file. You can follow this SOW question for this : Accessing files in HDFS using Java
UPDATED : to respond to the deprecated parts of the API : AvroWriteSupport should be replaced by AvroParquetWriter and I check ParquetWriter it's not deprecated and can be used safely.
Regards,
Loïc
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With