Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parquet without Hadoop?

I want to use parquet in one of my projects as columnar storage. But i dont want to depends on hadoop/hdfs libs. Is it possible to use parquet outside of hdfs? Or What is the min dependency?

like image 716
capacman Avatar asked Mar 26 '15 13:03

capacman


People also ask

Does Parquet require Hadoop?

You don't need to have HDFS/Hadoop for consuming Parquet file. There are different ways to consume Parquet.

What is difference between Avro & Parquet?

AVRO is a row-based storage format, whereas PARQUET is a columnar-based storage format. PARQUET is much better for analytical querying, i.e., reads and querying are much more efficient than writing. Writiing operations in AVRO are better than in PARQUET.

Is Apache Parquet free?

Apache Parquet is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem. It is similar to RCFile and ORC, the other columnar-storage file formats in Hadoop, and is compatible with most of the data processing frameworks around Hadoop.

Why is Parquet better for spark?

Parquet has higher execution speed compared to other standard file formats like Avro,JSON etc and it also consumes less disk space in compare to AVRO and JSON.


1 Answers

Investigating the same question I found that apparently it's not possible for the moment. I found this git issue, which proposes decoupling parquet from the hadoop api. Apparently it has not been done yet.

In the Apache Jira I found an issue, which asks for a way to read a parquet file outside hadoop. It is unresolved by the time of writing.

EDIT:

Issues are not tracked on github anymore (first link above is dead). A newer issue I found is located on apache's Jira with the following headline:

make it easy to read and write parquet files in java without depending on hadoop

like image 113
Fabian Braun Avatar answered Sep 19 '22 04:09

Fabian Braun