Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to query struct array with Hive (get_json_object) or json serde

I am trying to query the following JSON example file stored on my HDFS

{
    "tag1": "1.0",
    "tag2": "blah",
    "tag3": "blahblah",
    "tag4": {
        "tag4_1": [{
                "tag4_1_1": [{
                        "tag4_1_1_1": {
                            "Addr": {
                                "Addr1": "blah",
                                "City": "City",
                                "StateProvCd": "NY",
                                "PostalCode": "99999"
                            }
                        }
                        "tag4_1_1_1": {
                            "Addr": {
                                "Addr1": "blah2",
                                "City": "City2",
                                "StateProvCd": "NY",
                                "PostalCode": "99999"
                            }
                        }
                    }
                ]
            }
        ]
    }
}

I used the following to create an external table over the data

CREATE  EXTERNAL TABLE DB.hv_table
(
  tag1 string
, tag2 string
, tag3 string
, tag4 struct<tag4_1:ARRAY<struct<tag4_1_1:ARRAY<struct<tag4_1_1_1:struct<Addr
                Addr1:string
                , City:string
                , StateProvCd:string
                , PostalCode:string>>>>>>
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' 
LOCATION 'HDFS/location';

Ideally, I want to query the data such that it would return to me as such:

select tag1, tag2, tag3, tag4(all data) from DB.hv_table;

Can someone provide me an example of how I can query without writing it in the following manner:

select tag1, tag2, tag3
, tag4.tag4_1[0].tag4_1_1[0].tag4_1_1_1.Addr.Addr1 as Addr1 
, tag4.tag4_1[0].tag4_1_1[0].tag4_1_1_1.Addr.City as City 
, tag4.tag4_1[0].tag4_1_1[0].tag4_1_1_1.Addr.StateProvCd as StateProvCd 
, tag4.tag4_1[0].tag4_1_1[0].tag4_1_1_1.Addr.PostalCode as PostalCode 
from DB.hv_table

Most importantly, I would like to not define the array item element number. In my example, I am only able to target the first element of my array (tag4_1_1_1). I would to target everything if possible.

like image 550
DatWunGuy102 Avatar asked Jul 10 '17 19:07

DatWunGuy102


People also ask

How do you access a struct in hive?

Structs are written in JSON format. You can access the values using the dot notation for the field to extact the value.

What is JSON SerDe in hive?

The Hive JSON SerDe is commonly used to process JSON data like events. These events are represented as single-line strings of JSON-encoded text separated by a new line. The Hive JSON SerDe does not allow duplicate keys in map or struct key names.

Does Hive Support JSON?

Use the get_json_object UDFHive provides a built-in UDF called get_json_object that queries JSON during runtime. This method takes two arguments: the table name and method name. The method name has the flattened JSON document and the JSON field that needs to be parsed.


1 Answers

Found a really good blog at: ThornyDev

CREATE EXTERNAL TABLE IF NOT EXISTS DB.dummyTable (jsonBlob STRING)
LOCATION 'pathOfYourFiles';

SELECT 
get_json_object(jsonBlob, '$.tag1') AS tag1
,get_json_object(jsonBlob, '$.tag2') AS tag2
,get_json_object(jsonBlob, '$.tag3') AS tag3
,get_json_object(jsonBlob, '$.tag4.tag4_1.tag4_1_1.tag4_1_1_1.Addr.Addr1') AS Addr1
,get_json_object(jsonBlob, '$.tag4.tag4_1.tag4_1_1.tag4_1_1_1.Addr.City') AS City
,get_json_object(jsonBlob, '$.tag4.tag4_1.tag4_1_1.tag4_1_1_1.Addr.StateProvCd') AS StateProvCd
,get_json_object(jsonBlob, '$.tag4.tag4_1.tag4_1_1.tag4_1_1_1.Addr.PostalCode') AS PostalCode
FROM DB.dummyTable

I'm very satisfied, but I want to check out the json tuple and see how it performs versus the "get_json_object" class

like image 65
DatWunGuy102 Avatar answered Sep 25 '22 14:09

DatWunGuy102