Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parquet vs ORC vs ORC with Snappy

I am running a few tests on the storage formats available with Hive and using Parquet and ORC as major options. I included ORC once with default compression and once with Snappy.

I have read many a documents that state Parquet to be better in time/space complexity as compared to ORC but my tests are opposite to the documents I went through.

Follows some details of my data.

Table A- Text File Format- 2.5GB  Table B - ORC - 652MB  Table C - ORC with Snappy - 802MB  Table D - Parquet - 1.9 GB 

Parquet was worst as far as compression for my table is concerned.

My tests with the above tables yielded following results.

Row count operation

Text Format Cumulative CPU - 123.33 sec  Parquet Format Cumulative CPU - 204.92 sec  ORC Format Cumulative CPU - 119.99 sec   ORC with SNAPPY Cumulative CPU - 107.05 sec 

Sum of a column operation

Text Format Cumulative CPU - 127.85 sec     Parquet Format Cumulative CPU - 255.2 sec     ORC Format Cumulative CPU - 120.48 sec     ORC with SNAPPY Cumulative CPU - 98.27 sec 

Average of a column operation

Text Format Cumulative CPU - 128.79 sec  Parquet Format Cumulative CPU - 211.73 sec      ORC Format Cumulative CPU - 165.5 sec     ORC with SNAPPY Cumulative CPU - 135.45 sec  

Selecting 4 columns from a given range using where clause

Text Format Cumulative CPU -  72.48 sec   Parquet Format Cumulative CPU - 136.4 sec         ORC Format Cumulative CPU - 96.63 sec   ORC with SNAPPY Cumulative CPU - 82.05 sec  

Does that mean ORC is faster then Parquet? Or there is something that I can do to make it work better with query response time and compression ratio?

Thanks!

like image 837
Rahul Avatar asked Sep 03 '15 10:09

Rahul


People also ask

Which is better ORC or Parquet?

PARQUET is more capable of storing nested data. ORC is more capable of Predicate Pushdown. ORC supports ACID properties. ORC is more compression efficient.

Why ORC is preferred for Hive?

The Optimized Row Columnar (ORC) file format provides a highly efficient way to store Hive data. It was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data.

What is ORC compress snappy?

OrcFiles are binary files that are in a specialized format. When you specify orc. compress = SNAPPY the contents of the file are compressed using Snappy. Orc is a semi columnar file format.

What is snappy in Parquet?

By default Big SQL will use SNAPPY compression when writing into Parquet tables. This means that if data is loaded into Big SQL using either the LOAD HADOOP or INSERT… SELECT commands, then SNAPPY compression is enabled by default.


2 Answers

I would say, that both of these formats have their own advantages.

Parquet might be better if you have highly nested data, because it stores its elements as a tree like Google Dremel does (See here).
Apache ORC might be better if your file-structure is flattened.

And as far as I know parquet does not support Indexes yet. ORC comes with a light weight Index and since Hive 0.14 an additional Bloom Filter which might be helpful the better query response time especially when it comes to sum operations.

The Parquet default compression is SNAPPY. Are Table A - B - C and D holding the same Dataset? If yes it looks like there is something shady about it, when it only compresses to 1.9 GB

like image 152
PhanThomas Avatar answered Oct 01 '22 03:10

PhanThomas


You are seeing this because:

  • Hive has a vectorized ORC reader but no vectorized parquet reader.

  • Spark has a vectorized parquet reader and no vectorized ORC reader.

  • Spark performs best with parquet, hive performs best with ORC.

I've seen similar differences when running ORC and Parquet with Spark.

Vectorization means that rows are decoded in batches, dramatically improving memory locality and cache utilization.

(correct as of Hive 2.0 and Spark 2.1)

like image 44
jonathanChap Avatar answered Oct 01 '22 03:10

jonathanChap