Hive: adding rows to existing table

Question

I plan to use hadoop and hive to solve the following problem:

I have a stream of data, say of the form (timestamp, temperature) that represent measured temperature at the given timestamp. I need to compute some aggregates (e.g. max) on a daily basis. The aggregates need to be computed once each day (e.g. at midnight).

I thought of loading the data somehow to hive, partitioning it by date. However, there is one problem - the data in the stream does not need to be ordered by timestamp, I receive delayed records: a record may arrive even couple of days later than it should. In this case, while generating the usual aggregates, I need to compute aggregates for the day containing that timestamp as well.

Intuitively, I'd like to add the late record to the respective partition in the hive table. Is it possible to do this without reloading the whole partition? (and is it a costly operation to reload a partition?)

Yossi Dahan · Accepted Answer

I don't believe it is possible at the moment to add a record to a partition (or a table, for that matter), so you'll have to sort the records before loading the partition to the table - looks like a two phase process to me.

I believe that you can, however, overwrite a partition, so at least you could handle on the modified partition.

At the moment, at least, hive is a batch orientated system.

Hive: adding rows to existing table

Tags:

hadoop

hive

jfu

1 Answers

Yossi Dahan

Recent Activity

Donate For Us

Hive: adding rows to existing table

Tags:

hadoop

hive

jfu

1 Answers

Yossi Dahan

Related questions

Recent Activity

Donate For Us