the table is:
create table test ( id string, name string, age string, modified string)
data like this:
id name age modifed 1 a 10 2011-11-11 11:11:11 1 a 11 2012-11-11 12:00:00 2 b 20 2012-12-10 10:11:12 2 b 20 2012-12-10 10:11:12 2 b 20 2012-12-12 10:11:12 2 b 20 2012-12-15 10:11:12
I want to get the latest record(include every colums id,name,age,modifed) group by id,as the data above,the correct result is:
1 a 11 2012-11-11 12:00:00 2 b 20 2012-12-15 10:11:12
I do like this:
insert overwrite table t select b.id, b.name, b.age, b.modified from ( select id,max(modified) as modified from test group by id ) a left outer join test b on (a.id=b.id and a.modified=b.modified);
This sql can get the right result,but when mass data,it runs slow.
**Is there any way to do this without left outer join? **
You can use to_date function in your where clause to get only the max(Last uploaded date) records.
Here is the syntax that we can use to get the latest date records in SQL Server. Select column_name, .. From table_name Order By date_column Desc; Now, let's use the given syntax to select the last 10 records from our sample table.
Get the transient_lastDdlTime from your Hive table. SHOW CREATE TABLE table_name; Then copy paste the transient_lastDdlTime in below query to get the value as timestamp. This gives the creation time rather than the last modified time.
"transient_lastDdlTime" is the property which tells about the last altered time of Hive tables.
There's a nearly undocumented feature of Hive SQL (I found it in one of their Jira bug reports) that lets you do something like argmax() using struct()s. For example if you have a table like:
test_argmax id,val,key 1,1,A 1,2,B 1,3,C 1,2,D 2,1,E 2,1,U 2,2,V 2,3,W 2,2,X 2,1,Y
You can do this:
select max(struct(val, key, id)).col1 as max_val, max(struct(val, key, id)).col2 as max_key, max(struct(val, key, id)).col3 as max_id from test_argmax group by id
and get the result:
max_val,max_key,max_id 3,C,1 3,W,2
I think in case of ties on val (the first struct element) it will fall back to comparison on the second column. I also haven't figured out whether there's a neater syntax for getting the individual columns back out of the resulting struct, maybe using named_struct somehow?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With