Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hive: adding rows to existing table

Tags:

hadoop

hive

I plan to use hadoop and hive to solve the following problem:

I have a stream of data, say of the form (timestamp, temperature) that represent measured temperature at the given timestamp. I need to compute some aggregates (e.g. max) on a daily basis. The aggregates need to be computed once each day (e.g. at midnight).

I thought of loading the data somehow to hive, partitioning it by date. However, there is one problem - the data in the stream does not need to be ordered by timestamp, I receive delayed records: a record may arrive even couple of days later than it should. In this case, while generating the usual aggregates, I need to compute aggregates for the day containing that timestamp as well.

Intuitively, I'd like to add the late record to the respective partition in the hive table. Is it possible to do this without reloading the whole partition? (and is it a costly operation to reload a partition?)

like image 434
jfu Avatar asked Nov 04 '22 13:11

jfu


1 Answers

I don't believe it is possible at the moment to add a record to a partition (or a table, for that matter), so you'll have to sort the records before loading the partition to the table - looks like a two phase process to me.

I believe that you can, however, overwrite a partition, so at least you could handle on the modified partition.

At the moment, at least, hive is a batch orientated system.

like image 90
Yossi Dahan Avatar answered Nov 09 '22 09:11

Yossi Dahan