Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting an Array of Structs in Hive

I have an external table in hive

CREATE EXTERNAL TABLE FOO (  
  TS string,  
  customerId string,  
  products array< struct <productCategory:string, productId:string> >  
)  
PARTITIONED BY (ds string)  
ROW FORMAT SERDE 'some.serde'  
WITH SERDEPROPERTIES ('error.ignore'='true')  
LOCATION 'some_locations'  
;

A record of the table may hold data such as:

1340321132000, 'some_company', [{"productCategory":"footwear","productId":"nik3756"},{"productCategory":"eyewear","productId":"oak2449"}]

Do anyone know if there is a way to simply extract all the productCategory from this record and return it as an array of productCategories without using explode. Something like the following:

["footwear", "eyewear"] 

Or do I need to write my own GenericUDF, if so, I do not know much Java (a Ruby person), can someone give me some hints? I read some instructions on UDF from Apache Hive. However, I do not know which collection type is best to handle array, and what collection type to handle structs?

===

I have somewhat answered this question by writing a GenericUDF, but I ran into 2 other problems. It is in this SO Question

like image 732
pchu Avatar asked Mar 26 '13 03:03

pchu


2 Answers

You can use json serde or build-in functions get_json_object, json_tuple.

With rcongiu's Hive-JSON SerDe the usage will be:

define table:

CREATE TABLE complex_json (
DocId string,
Orders array<struct<ItemId:int, OrderDate:string>>)

load sample json into it (it is important for this data to be one-lined):

{"DocId":"ABC","Orders":[{"ItemId":1111,"OrderDate":"11/11/2012"},{"ItemId":2222,"OrderDate":"12/12/2012"}]}

Then fetching orders ids is as easy as:

SELECT Orders.ItemId FROM complex_json LIMIT 100;

It will return the list of ids for you:

itemid [1111,2222]

Proven to return correct results on my environment. Full listing:

add jar hdfs:///tmp/json-serde-1.3.6.jar;

CREATE TABLE complex_json (
  DocId string,
  Orders array<struct<ItemId:int, OrderDate:string>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';

LOAD DATA INPATH '/tmp/test.json' OVERWRITE INTO TABLE complex_json;

SELECT Orders.ItemId FROM complex_json LIMIT 100;

Read more here:

http://thornydev.blogspot.com/2013/07/querying-json-records-via-hive.html

like image 163
Viktor Avatar answered Nov 02 '22 17:11

Viktor


One way would be to use either the inline or explode functions, like so:

SELECT 
    TS,
    customerId,
    pCat,
    pId,
FROM FOO 
LATERAL VIEW inline(products) p AS pCat, pId

Otherwise you can write UDF. Check out this post and this post for that. Along with the following resources:

  • Matthew Rathbone's guide to writing generic UDFs
  • Mark Grover's how to guide
  • the baynote blog post on generic UDFs
like image 30
dstandish Avatar answered Nov 02 '22 16:11

dstandish