Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the best way to support array column types with external tables in hive?

Tags:

hadoop

hive

So i have external tables of tab delimited data. A simple table looks like this:

create external table if not exists categories
(id string, tag string, legid string, image string, parent string, created_date string, time_stamp int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION 's3n://somewhere/';

Now I'm adding another field to the end, it will be a comma separated list of values.

Is there a way to specify this in the same way that I specify a field terminator, or do I have to rely on one of the serdes?

eg:

...list_of_names ARRAY<String>)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' ARRAY ELEMENTS SEPARATED BY ','
...

(I'm assuming I'll need to use a serde for this, but I figured there wasn't any harm in asking)

like image 966
Matthew Rathbone Avatar asked Jun 07 '11 18:06

Matthew Rathbone


People also ask

Can we add column in external table in Hive?

So adding new columns into a table is a relatively cheap metadata-only operation as Hive does not modify the existing data files. Then when you retrieve data from the table Hive sets NULL values for columns that do not exist in old data files.

What happen if we create table external with data in Hive warehouse location?

When you create external table with out location , the data will be stored in the hive default location. Usually /apps/hive/warehouse/<database_name>. db/<table_name> . If you drop the table , you can find the data in this location.

Which type of table is preferable to store the table in a specific location in Hive?

one would prefer to choose external tables if the data is stored outside HDFS like S3.

Where Hive external tables store data?

An external table is stored on HDFS or any storage compatible with HDFS, because we want to use the data outside of Hive. Thus, Hive is not responsible for managing the storage of the external table. Tables can be stored on an external location for instance on a cloud platform like google cloud or AWS.


1 Answers

I don't know how to update an existing table to do that, but for creating a table; what you are looking for can be found, in depth, at https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL. A snippet from there

row_format
  : DELIMITED [FIELDS TERMINATED BY char] [COLLECTION ITEMS TERMINATED BY char]
        [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]

An example from our table creation is

CREATE TABLE IF NOT EXISTS visits
(
    ... Columns Removed...
)
    PARTITIONED BY (userdate STRING)
    ROW FORMAT DELIMITED
        FIELDS TERMINATED BY '\001'
        COLLECTION ITEMS TERMINATED BY '\002'
        MAP KEYS TERMINATED BY '\003'
    STORED AS TEXTFILE
;

The line from that you'd be looking for is the COLLECTION ITEMS TERMINATED BY char for an array.

hth

like image 198
QuinnG Avatar answered Sep 27 '22 19:09

QuinnG