Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

hive view with nested selects and partition pruning

Tags:

hadoop

hive

I have a view in HIVE with subselect - the purpose of the view is to remove dups from the source table.

The source table is partitioned by source_system column.

CREATE VIEW myview AS 
        SELECT * FROM (
            SELECT            
              *
              ,row_number() OVER (PARTITION BY source_system,key ORDER BY modification_date DESC) as seq_rn
            FROM mytable
        )  t
        WHERE seq_rn= 1
    ;

The problem is that if I do

EXPLAIN DEPENDENCY    SELECT * FROM myview WHERE source_system='AAA'

I see that all partitions are being scanned so partition pruning is not happening.

Is there any way around this?

like image 750
mishkin Avatar asked Oct 27 '16 18:10

mishkin


People also ask

Can view be partitioned in Hive?

hive> CREATE VIEW log_view PARTITIONED ON (pagename,year,month,day) AS SELECTuid,properties,pagename year,month,day FROM log; Reason: The column names used in the partition must be available at the end of view creation in the same order as mentioned in as partitions.

What are the 2 types of partitioning in Hive?

If you want to partition a number of columns but you don't know how many columns then also dynamic partition is suitable. Dynamic partition there is no required where clause to use limit. we can't perform alter on the Dynamic partition. You can perform dynamic partition on hive external table and managed table.

What is partition pruning in Hive?

Partition pruning is a performance optimization that enables a database engine (Hive in the case of CDR) to scan only necessary partitions. The Hive engine requires definite partition values in the execution plan to narrow down partitions to be scanned.

What is Partioning in Hive?

The partitioning in Hive means dividing the table into some parts based on the values of a particular column like date, course, city or country. The advantage of partitioning is that since the data is stored in slices, the query response time becomes faster.


1 Answers

Workaround

As mentioned in the latest comment it is possible to build views for every filter.


Note, the following does not help

As mentioned in the comments it should be possible to solve this using partitioned views as documented here: https://cwiki.apache.org/confluence/display/Hive/PartitionedViews#PartitionedViews-Syntax

In case the partitioning does not extend to subqueries, try this:

  1. Make a view with the inner query
  2. Make a second view on top of that, with the outer query

I would normally not advocate building views on views, but if this is what it takes to let partitions work, this would of course justify the design choice.

like image 120
Dennis Jaheruddin Avatar answered Nov 01 '22 18:11

Dennis Jaheruddin