I am running a few tests on the storage formats available with Hive and using Parquet and ORC as major options. I included ORC once with default compression and once with Snappy.
I have read many a documents that state Parquet to be better in time/space complexity as compared to ORC but my tests are opposite to the documents I went through.
Follows some details of my data.
Table A- Text File Format- 2.5GB Table B - ORC - 652MB Table C - ORC with Snappy - 802MB Table D - Parquet - 1.9 GB
Parquet was worst as far as compression for my table is concerned.
My tests with the above tables yielded following results.
Row count operation
Text Format Cumulative CPU - 123.33 sec Parquet Format Cumulative CPU - 204.92 sec ORC Format Cumulative CPU - 119.99 sec ORC with SNAPPY Cumulative CPU - 107.05 sec
Sum of a column operation
Text Format Cumulative CPU - 127.85 sec Parquet Format Cumulative CPU - 255.2 sec ORC Format Cumulative CPU - 120.48 sec ORC with SNAPPY Cumulative CPU - 98.27 sec
Average of a column operation
Text Format Cumulative CPU - 128.79 sec Parquet Format Cumulative CPU - 211.73 sec ORC Format Cumulative CPU - 165.5 sec ORC with SNAPPY Cumulative CPU - 135.45 sec
Selecting 4 columns from a given range using where clause
Text Format Cumulative CPU - 72.48 sec Parquet Format Cumulative CPU - 136.4 sec ORC Format Cumulative CPU - 96.63 sec ORC with SNAPPY Cumulative CPU - 82.05 sec
Does that mean ORC is faster then Parquet? Or there is something that I can do to make it work better with query response time and compression ratio?
Thanks!
PARQUET is more capable of storing nested data. ORC is more capable of Predicate Pushdown. ORC supports ACID properties. ORC is more compression efficient.
The Optimized Row Columnar (ORC) file format provides a highly efficient way to store Hive data. It was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data.
OrcFiles are binary files that are in a specialized format. When you specify orc. compress = SNAPPY the contents of the file are compressed using Snappy. Orc is a semi columnar file format.
By default Big SQL will use SNAPPY compression when writing into Parquet tables. This means that if data is loaded into Big SQL using either the LOAD HADOOP or INSERT… SELECT commands, then SNAPPY compression is enabled by default.
I would say, that both of these formats have their own advantages.
Parquet might be better if you have highly nested data, because it stores its elements as a tree like Google Dremel does (See here).
Apache ORC might be better if your file-structure is flattened.
And as far as I know parquet does not support Indexes yet. ORC comes with a light weight Index and since Hive 0.14 an additional Bloom Filter which might be helpful the better query response time especially when it comes to sum operations.
The Parquet default compression is SNAPPY. Are Table A - B - C and D holding the same Dataset? If yes it looks like there is something shady about it, when it only compresses to 1.9 GB
You are seeing this because:
Hive has a vectorized ORC reader but no vectorized parquet reader.
Spark has a vectorized parquet reader and no vectorized ORC reader.
Spark performs best with parquet, hive performs best with ORC.
I've seen similar differences when running ORC and Parquet with Spark.
Vectorization means that rows are decoded in batches, dramatically improving memory locality and cache utilization.
(correct as of Hive 2.0 and Spark 2.1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With