Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find the most recent partition in HIVE table

I have a partitioned table - with 201 partitions. I need to find latest partition in this table and use it to post process my data. The query to find list of all partitions is :

use db;
show partitions table_name; 

I need a query to find the latest of these partitions. The partitions are in format

ingest_date=2016-03-09

I tried using max() which gave me a wrong result. I do not want to traverse through entire table by doing

select max(ingest_date) from db.table_name; 

This would give me the expected output.. but kill the whole point of having partitions in the 1st place.

Is there a more efficient query to get the latest partition for HIve table ?

like image 214
underwood Avatar asked Mar 18 '16 23:03

underwood


People also ask

How do I see the latest partition in Hive?

The only efficient way to list the partitions of a Hive table is to use the Metastore java API. Or to reverse-engineer the way partitions are organized in HDFS (quite straightforward actually) then list the directories, hoping HDFS and the Metastore are 100% in sync.

How do you check partitions on a Hive table?

You can see Hive MetaStore tables,Partitions information in table of "PARTITIONS". You could use "TBLS" join "Partition" to query special table partitions.

Which command is used to see the partition keys present in a Hive table?

You can run the HDFS list command to show all partition folders of a table from the Hive data warehouse location.

What is the use of MSCK repair table in Hive?

Use the MSCK REPAIR TABLE command to update the metadata in the catalog after you add Hive compatible partitions. The MSCK REPAIR TABLE command scans a file system such as Amazon S3 for Hive compatible partitions that were added to the file system after the table was created.


Video Answer


3 Answers

You can use "show partitions":

hive -e "set hive.cli.print.header=false;show partitions table_name;" | tail -1 | cut -d'=' -f2

This will give you "2016-03-09" as output.

like image 59
pkgajulapalli Avatar answered Oct 18 '22 19:10

pkgajulapalli


If you want to avoid running the "show partitions" in hive shell as suggested above, you can apply a filter to your max() query. That will avoid doing a fulltable scan and results should be fairly quick!

select max(ingest_date) from db.table_name where ingest_date>date_add(current_date,-3) will only scan 2-3 partitions.

like image 10
Kash Avatar answered Oct 18 '22 19:10

Kash


if you know your table location in hdfs. This is the most quick way without even opening the hive shell.

You can check you table location in hdfs using command;

show create table <table_name>

then

hdfs dfs -ls <table_path>| sort -k6,7 | tail -1

It will show latest partition location in hdfs

like image 2
Strick Avatar answered Oct 18 '22 20:10

Strick