Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to map column names in a hive table and replace it with new values in hive table

I have a csv data as below where data comes every 10mins in the following format. I need to insert this data into hive by mapping column names with different column names. (columns don't come in constant order they change their order, we have total 10 columns sometimes we miss many columns like one example below below) sample csv file :-

1  2  6  4
u  f  b  h
a  f  r  m
q  r  b  c

now while inserting into hive i need to replace column names for example

1 -> NBR
2 -> GMB
3 -> GSB
4 -> KTC
5 -> VRV
6 -> AMB

now I need to insert into hive table as below

NBR GMB   GSB   KTC   VRV   AMB
 u   f    NULL  h     NULL   b     
 a   f    NULL  m     NULL   r     

can anyone help me with this how to insert this values into hive

like image 845
Rahul Varma Avatar asked Jul 16 '19 05:07

Rahul Varma


2 Answers

Assuming you can get column headers in you source CSV, you will need to map them from source number to their column names.

sed -i 's/1/NBR/g; s/2/GMB/g; s/3/GSB/g; s/4/KTC/g; s/5/VRV/g; s/6/AMB/g;...;...;...;...' input.csv

Since you only get an unknown subset of the total columns in your hive table, you will need to translate your CSV from

NBR,GMB,AMB,KTC
u,f,b,h
a,f,r,m
q,r,b,c

to

NBR,GMB,GSB,KTC,VRV,AMB,...,...,...,...
u,f,null,b,null,h,null,null,null,null
a,f,null,r,null,m,null,null,null,null
q,r,null,b,null,c,null,null,null,null

in order to properly insert them into your table.

From the Apache Wiki:

Values must be provided for every column in the table. The standard SQL syntax that allows the user to insert values into only some columns is not yet supported. To mimic the standard SQL, nulls can be provided for columns the user does not wish to assign a value to.

Standard Syntax:
INSERT INTO TABLE tablename [PARTITION (partcol1[=val1], partcol2[=val2] ...)] VALUES values_row [, values_row ...]

Where values_row is:
( value [, value ...] )
where a value is either null or any valid SQL literal

Using LOAD DATA INPATH, even with the tblproperties("skip.header.line.count"="1") set, still requires a valid SQL literal for all columns in the table. This is why youre missing columns.

If you can not get the producer of the CSV to create a file with 1,2,...9,10 columns in order with your table columns and either consecutive commas or a null character in the data, write some kind of script to add missing column names, in the order you need them in, and the required null values in the data.

like image 175
Chris Marotta Avatar answered Oct 01 '22 16:10

Chris Marotta


If you will have header in csv like 1,2,3,4 (as you wrote in the comment), you could use the next syntax:

insert into table (columns where you want to insert) select 1,2,3,4 (columns) from csv_table;

So, if you could know the order of csv columns, you could write easily the insert, naming only the column that you need to populate, no matter the order in the target table.

Before you could run the above insert, you should create a table that reads from csv!

like image 26
F.Lazarescu Avatar answered Oct 01 '22 17:10

F.Lazarescu