Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

hive sql find the latest record

the table is:

create table test ( id string, name string, age string, modified string) 

data like this:

id    name   age  modifed 1     a      10   2011-11-11 11:11:11 1     a      11   2012-11-11 12:00:00 2     b      20   2012-12-10 10:11:12 2     b      20   2012-12-10 10:11:12 2     b      20   2012-12-12 10:11:12 2     b      20   2012-12-15 10:11:12 

I want to get the latest record(include every colums id,name,age,modifed) group by id,as the data above,the correct result is:

1     a      11   2012-11-11 12:00:00 2     b      20   2012-12-15 10:11:12 

I do like this:

insert overwrite table t  select b.id, b.name, b.age, b.modified  from (         select id,max(modified) as modified          from test          group by id ) a  left outer join test b on (a.id=b.id  and a.modified=b.modified); 

This sql can get the right result,but when mass data,it runs slow.

**Is there any way to do this without left outer join? **

like image 477
qiulp Avatar asked Nov 23 '12 04:11

qiulp


People also ask

How do I get the last updated Hive?

You can use to_date function in your where clause to get only the max(Last uploaded date) records.

How do I get the latest record in SQL?

Here is the syntax that we can use to get the latest date records in SQL Server. Select column_name, .. From table_name Order By date_column Desc; Now, let's use the given syntax to select the last 10 records from our sample table.

When was the table last updated Hive?

Get the transient_lastDdlTime from your Hive table. SHOW CREATE TABLE table_name; Then copy paste the transient_lastDdlTime in below query to get the value as timestamp. This gives the creation time rather than the last modified time.

What is transient_lastDdlTime in Hive?

"transient_lastDdlTime" is the property which tells about the last altered time of Hive tables.


1 Answers

There's a nearly undocumented feature of Hive SQL (I found it in one of their Jira bug reports) that lets you do something like argmax() using struct()s. For example if you have a table like:

test_argmax id,val,key 1,1,A 1,2,B 1,3,C 1,2,D 2,1,E 2,1,U 2,2,V 2,3,W 2,2,X 2,1,Y 

You can do this:

select    max(struct(val, key, id)).col1 as max_val,   max(struct(val, key, id)).col2 as max_key,   max(struct(val, key, id)).col3 as max_id from test_argmax group by id 

and get the result:

max_val,max_key,max_id 3,C,1 3,W,2 

I think in case of ties on val (the first struct element) it will fall back to comparison on the second column. I also haven't figured out whether there's a neater syntax for getting the individual columns back out of the resulting struct, maybe using named_struct somehow?

like image 170
patricksurry Avatar answered Oct 11 '22 13:10

patricksurry