I have a csv data as below where data comes every 10mins in the following format. I need to insert this data into hive by mapping column names with different column names. (columns don't come in constant order they change their order, we have total 10 columns sometimes we miss many columns like one example below below) sample csv file :-
1 2 6 4
u f b h
a f r m
q r b c
now while inserting into hive i need to replace column names for example
1 -> NBR
2 -> GMB
3 -> GSB
4 -> KTC
5 -> VRV
6 -> AMB
now I need to insert into hive table as below
NBR GMB GSB KTC VRV AMB
u f NULL h NULL b
a f NULL m NULL r
can anyone help me with this how to insert this values into hive
Assuming you can get column headers in you source CSV, you will need to map them from source number to their column names.
sed -i 's/1/NBR/g; s/2/GMB/g; s/3/GSB/g; s/4/KTC/g; s/5/VRV/g; s/6/AMB/g;...;...;...;...' input.csv
Since you only get an unknown subset of the total columns in your hive table, you will need to translate your CSV from
NBR,GMB,AMB,KTC
u,f,b,h
a,f,r,m
q,r,b,c
to
NBR,GMB,GSB,KTC,VRV,AMB,...,...,...,...
u,f,null,b,null,h,null,null,null,null
a,f,null,r,null,m,null,null,null,null
q,r,null,b,null,c,null,null,null,null
in order to properly insert them into your table.
From the Apache Wiki:
Values must be provided for every column in the table. The standard SQL syntax that allows the user to insert values into only some columns is not yet supported. To mimic the standard SQL, nulls can be provided for columns the user does not wish to assign a value to.
Standard Syntax:
INSERT INTO TABLE tablename [PARTITION (partcol1[=val1], partcol2[=val2] ...)] VALUES values_row [, values_row ...]
Where values_row is:
( value [, value ...] )
where a value is either null or any valid SQL literal
Using LOAD DATA INPATH
, even with the tblproperties("skip.header.line.count"="1")
set, still requires a valid SQL literal for all columns in the table. This is why youre missing columns.
If you can not get the producer of the CSV to create a file with 1,2,...9,10 columns in order with your table columns and either consecutive commas or a null
character in the data, write some kind of script to add missing column names, in the order you need them in, and the required null
values in the data.
If you will have header in csv like 1,2,3,4 (as you wrote in the comment), you could use the next syntax:
insert into table (columns where you want to insert) select 1,2,3,4 (columns) from csv_table;
So, if you could know the order of csv columns, you could write easily the insert, naming only the column that you need to populate, no matter the order in the target table.
Before you could run the above insert, you should create a table that reads from csv!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With