Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

generating unique ids in hive

I have been trying to generate unique ids for each row of a table (30 million+ rows).

  • using sequential numbers obviously not does not work due to the parallel nature of Hadoop.
  • the built in UDFs rand() and hash(rand(),unixtime()) seem to generate collisions.

There has to be a simple way to generate row ids, and I was wondering of anyone has a solution.

  • my next step is just creating a Java map reduce job to generate a real hash string with a secure random + host IP + current time as a seed. but I figure I'd ask here before doing it ;)
like image 909
user1745713 Avatar asked Aug 15 '13 17:08

user1745713


People also ask

How are unique IDs generated?

Sometimes called time-based UUIDs, these IDs are generated using a combination of datetime values (reflecting the time the UUID is being generated), a random value, and a part of the MAC address of the device generating the UUID.

How do I create a unique key in hive?

While selecting a set of records from a big data hive table, a unique key needs to be created for each record. In a sequential mode of operation , it is easy to generate unique id by calling soem thing like max(id).

What is UUID in hive?

UUIDs are 16-byte (128-bit) numbers used to uniquely identify records.

Can we create primary key in hive?

There is no primary key concept in Hive as it is not a database and in hive operation is file based not the record based.


1 Answers

Use the reflect UDF to generate UUIDs.

reflect("java.util.UUID", "randomUUID")

Update (2019)

For a long time, UUIDs were your best bet for getting unique values in Hive. As of Hive 4.0, Hive offers a surrogate key UDF which you can use to generate unique values which will be far more performant than UUID strings. Documentation is a bit sparse still but here is one example:

create table customer (
  id bigint default surrogate_key(),
  name string, 
  city string, 
  primary key (id) disable novalidate
);

To have Hive generate IDs for you, use a column list in the insert statement and don't mention the surrogate key column:

-- staging_table would have two string columns.
insert into customer (name, city) select * from staging_table;
like image 173
Carter Shanklin Avatar answered Oct 15 '22 07:10

Carter Shanklin