I am running a few tests on the storage formats available with Hive and using Parquet and ORC as major options. I included ORC once with default compression and once with Snappy. I have read many a documents that state Parquet to be better in time/space complexity as compared to ORC but my tests are opposite to the documents I went through. Follows some details of my data. <pre class="prettyprint"><code>Table A- Text File Format- 2.5GB Table B - ORC - 652MB Table C - ORC with Snappy - 802MB Table D - Parquet - 1.9 GB </code></pre> Parquet was worst as far as compression for my table is concerned. My tests with the above tables yielded following results. Row count operation <pre class="prettyprint"><code>Text Format Cumulative CPU - 123.33 sec Parquet Format Cumulative CPU - 204.92 sec ORC Format Cumulative CPU - 119.99 sec ORC with SNAPPY Cumulative CPU - 107.05 sec </code></pre> Sum of a column operation <pre class="prettyprint"><code>Text Format Cumulative CPU - 127.85 sec Parquet Format Cumulative CPU - 255.2 sec ORC Format Cumulative CPU - 120.48 sec ORC with SNAPPY Cumulative CPU - 98.27 sec </code></pre> Average of a column operation <pre class="prettyprint"><code>Text Format Cumulative CPU - 128.79 sec Parquet Format Cumulative CPU - 211.73 sec ORC Format Cumulative CPU - 165.5 sec ORC with SNAPPY Cumulative CPU - 135.45 sec </code></pre> Selecting 4 columns from a given range using where clause <pre class="prettyprint"><code>Text Format Cumulative CPU - 72.48 sec Parquet Format Cumulative CPU - 136.4 sec ORC Format Cumulative CPU - 96.63 sec ORC with SNAPPY Cumulative CPU - 82.05 sec </code></pre> Does that mean ORC is faster then Parquet? Or there is something that I can do to make it work better with query response time and compression ratio? Thanks!

You are seeing this because: <ul> <li>Hive has a vectorized ORC reader but no vectorized parquet reader.</li> <li>Spark has a vectorized parquet reader and no vectorized ORC reader.</li> <li>Spark performs best with parquet, hive performs best with ORC.</li> </ul> I've seen similar differences when running ORC and Parquet with Spark. Vectorization means that rows are decoded in batches, dramatically improving memory locality and cache utilization. (correct as of Hive 2.0 and Spark 2.1)

Parquet vs ORC vs ORC with Snappy

Tags:

hadoop

hive

parquet

snappy

orc

I am running a few tests on the storage formats available with Hive and using Parquet and ORC as major options. I included ORC once with default compression and once with Snappy.

I have read many a documents that state Parquet to be better in time/space complexity as compared to ORC but my tests are opposite to the documents I went through.

Follows some details of my data.

Table A- Text File Format- 2.5GB  Table B - ORC - 652MB  Table C - ORC with Snappy - 802MB  Table D - Parquet - 1.9 GB

Parquet was worst as far as compression for my table is concerned.

My tests with the above tables yielded following results.

Row count operation

Text Format Cumulative CPU - 123.33 sec  Parquet Format Cumulative CPU - 204.92 sec  ORC Format Cumulative CPU - 119.99 sec   ORC with SNAPPY Cumulative CPU - 107.05 sec

Sum of a column operation

Text Format Cumulative CPU - 127.85 sec     Parquet Format Cumulative CPU - 255.2 sec     ORC Format Cumulative CPU - 120.48 sec     ORC with SNAPPY Cumulative CPU - 98.27 sec

Average of a column operation

Text Format Cumulative CPU - 128.79 sec  Parquet Format Cumulative CPU - 211.73 sec      ORC Format Cumulative CPU - 165.5 sec     ORC with SNAPPY Cumulative CPU - 135.45 sec

Selecting 4 columns from a given range using where clause

Text Format Cumulative CPU -  72.48 sec   Parquet Format Cumulative CPU - 136.4 sec         ORC Format Cumulative CPU - 96.63 sec   ORC with SNAPPY Cumulative CPU - 82.05 sec

Does that mean ORC is faster then Parquet? Or there is something that I can do to make it work better with query response time and compression ratio?

Thanks!

837

asked Sep 03 '15 10:09

Rahul

2 Answers

I would say, that both of these formats have their own advantages.

Parquet might be better if you have highly nested data, because it stores its elements as a tree like Google Dremel does (See here).
Apache ORC might be better if your file-structure is flattened.

And as far as I know parquet does not support Indexes yet. ORC comes with a light weight Index and since Hive 0.14 an additional Bloom Filter which might be helpful the better query response time especially when it comes to sum operations.

The Parquet default compression is SNAPPY. Are Table A - B - C and D holding the same Dataset? If yes it looks like there is something shady about it, when it only compresses to 1.9 GB

152

answered Oct 01 '22 03:10

PhanThomas

You are seeing this because:

Hive has a vectorized ORC reader but no vectorized parquet reader.
Spark has a vectorized parquet reader and no vectorized ORC reader.
Spark performs best with parquet, hive performs best with ORC.

I've seen similar differences when running ORC and Parquet with Spark.

Vectorization means that rows are decoded in batches, dramatically improving memory locality and cache utilization.

(correct as of Hive 2.0 and Spark 2.1)

answered Oct 01 '22 03:10

jonathanChap

Related questions
                            
                                Hive insert query like SQL
                            
                                Write to multiple outputs by key Spark - one Spark job
                            
                                Hive: how to show all partitions of a table?
                            
                                HDFS error: could only be replicated to 0 nodes, instead of 1
                            
                                Integration testing Hive jobs
                            
                                How to Delete a directory from Hadoop cluster which is having comma(,) in its name?
                            
                                Differences between Amazon S3 and S3n in Hadoop
                            
                                How to delete and update a record in Hive
                            
                                What is Hive: Return Code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
                            
                                Is there any way to get the column name along with the output while execute any query in Hive?
                            
                                Buiding Hadoop with Eclipse / Maven - Missing artifact jdk.tools:jdk.tools:jar:1.6
                            
                                Where does Hive store files in HDFS?
                            
                                merge output files after reduce phase
                            
                                hadoop copy a local file system folder to HDFS
                            
                                Hadoop truncated/inconsistent counter name
                            
                                How to check if ZooKeeper is running or up from command prompt?
                            
                                When do reduce tasks start in Hadoop?
                            
                                How do I output the results of a HiveQL query to CSV?
                            
                                Large scale data processing Hbase vs Cassandra [closed]
                            
                                Container is running beyond memory limits

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With