Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

hive auto increment after certain number

Tags:

hive

I have a to insert data into a target table where all columns should be populated from different source tables except the surrogate key column; which should be maximum value of the target table plus auto increment value starting 1. I can generate auto increment value by using row_number() function, but in the same query how should I get the max value of surrogate key from target table. Is there any concept in HIVE where I can select the max value of surrogate key and save it in a temporary variable? Or is there any other simple way to achieve this result?

like image 624
Koushik Chandra Avatar asked Aug 15 '16 05:08

Koushik Chandra


1 Answers

Here are two approaches which worked for me for the above problem. ( explained with example)

Approach 1: getting the max and setting to hive commands through ${hiveconf} variable using shell script

Approach 2: using row_sequence(), max() and join operations

My Environment:

hadoop-2.6.0
apache-hive-2.0.0-bin

Steps: (note: step 1 and step 2 are common for both approaches. Starting from step 3 , it differs for both)

Step 1: create source and target tables

source

hive>create table source_table1(string name);
hive>create table source_table2(string name);
hive>create table source_table2(string name);

target

hive>create table target_table(int id,string name);

Step 2: load data into source tables

hive>load data local inpath 'source_table1.txt' into table source_table1;
hive>load data local inpath 'source_table2.txt' into table source_table2;
hive>load data local inpath 'source_table3.txt' into table source_table3;

Sample Input:

source_table1.txt

a
b
c

source_table2.txt

d
e
f

source_table3.txt

g
h
i

Approach 1:

Step 3: create a shell script hive_auto_increment.sh

#!/bin/sh
hive -e 'select max(id) from target_table' > max.txt
wait
value=`cat max.txt`
hive --hiveconf mx=$value -e "add jar /home/apache-hive-2.0.0-bin/lib/hive-contrib-2.0.0.jar;
create temporary function row_sequence as 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence';
set mx;
set hiveconf:mx;
INSERT INTO TABLE target_table SELECT row_sequence(),name from source_table1;
INSERT INTO TABLE target_table SELECT (\${hiveconf:mx} +row_sequence()),name from source_table2;
INSERT INTO TABLE target_table SELECT (\${hiveconf:mx} +row_sequence()),name from source_table3;"
wait
hive -e "select * from target_table;"

Step 4: run the shell script

 > bash  hive_auto_increment.sh 

Approach 2:

Step 3: Add Jar

    hive>add jar /home/apache-hive-2.0.0-bin/lib/hive-contrib-2.0.0.jar;

Step 4: register row_sequence function with help of hive contrib jar

hive>create temporary function row_sequence as 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence';

Step 5: load the source_table1 to target_table

hive>INSERT INTO TABLE target_table select row_sequence(),name from source_table1;

Step 6: load the other sources to target_table

  hive>INSERT INTO TABLE target_table SELECT M.rowcount+row_sequence(),T.name from source_table2 T join (select max(id) as rowcount from target_table) M;

  hive>INSERT INTO TABLE target_table SELECT  M.rowcount+row_sequence(),T.name from source_table3 T join (select max(id) as rowcount from target_table) M;

output:

INFO  : OK
+---------------+-----------------+--+
| target_table.id  | target_table.name  
+---------------+-----------------+--+
| 1                | a               |
| 2                | b               |
| 3                | c               |
| 4                | d               |
| 5                | e               |
| 6                | f               |
| 7                | g               |
| 8                | h               |
| 9                | i               |
like image 190
Aditya Avatar answered Oct 13 '22 00:10

Aditya