I have been trying to generate unique ids for each row of a table (30 million+ rows). <ul> <li>using sequential numbers obviously not does not work due to the parallel nature of Hadoop.</li> <li>the built in UDFs rand() and hash(rand(),unixtime()) seem to generate collisions.</li> </ul> There has to be a simple way to generate row ids, and I was wondering of anyone has a solution. <ul> <li>my next step is just creating a Java map reduce job to generate a real hash string with a secure random + host IP + current time as a seed. but I figure I'd ask here before doing it ;)</li> </ul>

Use the reflect UDF to generate UUIDs. <pre class="prettyprint"><code>reflect("java.util.UUID", "randomUUID") </code></pre> Update (2019) For a long time, UUIDs were your best bet for getting unique values in Hive. As of Hive 4.0, Hive offers a surrogate key UDF which you can use to generate unique values which will be far more performant than UUID strings. Documentation is a bit sparse still but here is one example: <pre class="prettyprint"><code>create table customer ( id bigint default surrogate_key(), name string, city string, primary key (id) disable novalidate ); </code></pre> To have Hive generate IDs for you, use a column list in the insert statement and don't mention the surrogate key column: <pre class="prettyprint"><code>-- staging_table would have two string columns. insert into customer (name, city) select * from staging_table; </code></pre>

generating unique ids in hive

1 Answers

Use the reflect UDF to generate UUIDs.

reflect("java.util.UUID", "randomUUID")

Update (2019)

For a long time, UUIDs were your best bet for getting unique values in Hive. As of Hive 4.0, Hive offers a surrogate key UDF which you can use to generate unique values which will be far more performant than UUID strings. Documentation is a bit sparse still but here is one example:

create table customer (
  id bigint default surrogate_key(),
  name string, 
  city string, 
  primary key (id) disable novalidate
);

To have Hive generate IDs for you, use a column list in the insert statement and don't mention the surrogate key column:

-- staging_table would have two string columns.
insert into customer (name, city) select * from staging_table;

173

answered Oct 15 '22 07:10

Carter Shanklin

Related questions
                            
                                what are the replacement for hadoop Job deprecated class
                            
                                Hadoop WordCount example stuck at map 100% reduce 0%
                            
                                How do I delete files in hdfs directory after reading it using scala?
                            
                                Small files and HDFS blocks
                            
                                How to run a jar file in hadoop?
                            
                                First hadoop project error: "Input path does not exist"
                            
                                Edge nodes in hadoop cluster
                            
                                How to start learning hadoop [closed]
                            
                                Class Not Found Exception in Mapreduce wordcount job
                            
                                Understanding LongWritable
                            
                                Apache Spark error while start
                            
                                How to kill an application from the ResourceManager Web UI
                            
                                SafeModeException : Name node is in safe mode
                            
                                How the data is split in Hadoop
                            
                                Hadoop: key and value are tab separated in the output file. how to do it semicolon-separated?
                            
                                hadoop java.io.IOException: while running namenode -format
                            
                                hdfs - ls: Failed on local exception: com.google.protobuf.InvalidProtocolBufferException:
                            
                                How do I install protobuf 2.5 on Arch Linux for compiling hadoop 2.6.0 using maven 3.3.1?
                            
                                Hadoop-2.2.0 "It looks like you are making an HTTP request to a Hadoop IPC port. "
                            
                                Is it worth purchasing Mahout in Action to get up to speed with Mahout, or are there other better sources?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

generating unique ids in hive

Tags:

identifier

hash

hadoop

hive

user1745713

People also ask

1 Answers

Carter Shanklin

Recent Activity

Donate For Us