I have been trying to generate unique ids for each row of a table (30 million+ rows).
There has to be a simple way to generate row ids, and I was wondering of anyone has a solution.
Sometimes called time-based UUIDs, these IDs are generated using a combination of datetime values (reflecting the time the UUID is being generated), a random value, and a part of the MAC address of the device generating the UUID.
While selecting a set of records from a big data hive table, a unique key needs to be created for each record. In a sequential mode of operation , it is easy to generate unique id by calling soem thing like max(id).
UUIDs are 16-byte (128-bit) numbers used to uniquely identify records.
There is no primary key concept in Hive as it is not a database and in hive operation is file based not the record based.
Use the reflect UDF to generate UUIDs.
reflect("java.util.UUID", "randomUUID")
Update (2019)
For a long time, UUIDs were your best bet for getting unique values in Hive. As of Hive 4.0, Hive offers a surrogate key UDF which you can use to generate unique values which will be far more performant than UUID strings. Documentation is a bit sparse still but here is one example:
create table customer (
id bigint default surrogate_key(),
name string,
city string,
primary key (id) disable novalidate
);
To have Hive generate IDs for you, use a column list in the insert statement and don't mention the surrogate key column:
-- staging_table would have two string columns.
insert into customer (name, city) select * from staging_table;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With