I am trying to query the following JSON example file stored on my HDFS
{
"tag1": "1.0",
"tag2": "blah",
"tag3": "blahblah",
"tag4": {
"tag4_1": [{
"tag4_1_1": [{
"tag4_1_1_1": {
"Addr": {
"Addr1": "blah",
"City": "City",
"StateProvCd": "NY",
"PostalCode": "99999"
}
}
"tag4_1_1_1": {
"Addr": {
"Addr1": "blah2",
"City": "City2",
"StateProvCd": "NY",
"PostalCode": "99999"
}
}
}
]
}
]
}
}
I used the following to create an external table over the data
CREATE EXTERNAL TABLE DB.hv_table
(
tag1 string
, tag2 string
, tag3 string
, tag4 struct<tag4_1:ARRAY<struct<tag4_1_1:ARRAY<struct<tag4_1_1_1:struct<Addr
Addr1:string
, City:string
, StateProvCd:string
, PostalCode:string>>>>>>
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 'HDFS/location';
Ideally, I want to query the data such that it would return to me as such:
select tag1, tag2, tag3, tag4(all data) from DB.hv_table;
Can someone provide me an example of how I can query without writing it in the following manner:
select tag1, tag2, tag3
, tag4.tag4_1[0].tag4_1_1[0].tag4_1_1_1.Addr.Addr1 as Addr1
, tag4.tag4_1[0].tag4_1_1[0].tag4_1_1_1.Addr.City as City
, tag4.tag4_1[0].tag4_1_1[0].tag4_1_1_1.Addr.StateProvCd as StateProvCd
, tag4.tag4_1[0].tag4_1_1[0].tag4_1_1_1.Addr.PostalCode as PostalCode
from DB.hv_table
Most importantly, I would like to not define the array item element number. In my example, I am only able to target the first element of my array (tag4_1_1_1). I would to target everything if possible.
Structs are written in JSON format. You can access the values using the dot notation for the field to extact the value.
The Hive JSON SerDe is commonly used to process JSON data like events. These events are represented as single-line strings of JSON-encoded text separated by a new line. The Hive JSON SerDe does not allow duplicate keys in map or struct key names.
Use the get_json_object UDFHive provides a built-in UDF called get_json_object that queries JSON during runtime. This method takes two arguments: the table name and method name. The method name has the flattened JSON document and the JSON field that needs to be parsed.
Found a really good blog at: ThornyDev
CREATE EXTERNAL TABLE IF NOT EXISTS DB.dummyTable (jsonBlob STRING)
LOCATION 'pathOfYourFiles';
SELECT
get_json_object(jsonBlob, '$.tag1') AS tag1
,get_json_object(jsonBlob, '$.tag2') AS tag2
,get_json_object(jsonBlob, '$.tag3') AS tag3
,get_json_object(jsonBlob, '$.tag4.tag4_1.tag4_1_1.tag4_1_1_1.Addr.Addr1') AS Addr1
,get_json_object(jsonBlob, '$.tag4.tag4_1.tag4_1_1.tag4_1_1_1.Addr.City') AS City
,get_json_object(jsonBlob, '$.tag4.tag4_1.tag4_1_1.tag4_1_1_1.Addr.StateProvCd') AS StateProvCd
,get_json_object(jsonBlob, '$.tag4.tag4_1.tag4_1_1.tag4_1_1_1.Addr.PostalCode') AS PostalCode
FROM DB.dummyTable
I'm very satisfied, but I want to check out the json tuple and see how it performs versus the "get_json_object" class
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With