Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do you make a HIVE table out of JSON data?

I want to create a Hive table out of some JSON data (nested) and run queries on it? Is this even possible?

I've gotten as far as uploading the JSON file to S3 and launching an EMR instance but I don't know what to type in the hive console to get the JSON file to be a Hive table?

Does anyone have some example command to get me started, I can't find anything useful with Google ...

like image 815
nickponline Avatar asked Jul 13 '12 22:07

nickponline


People also ask

How will you create a table in Hive for a JSON input file?

json-serde jar is located at /data/serde directory in hdfs, before creating the table add the serde jar in hive with add jar followed by jar path command now you can create the table using the syntax displayed on the screen please note the row format unlike previous examples fields are not terminated by tab or comma.

Can we convert JSON to table?

Take the JSON Object in a variable. Call a function which first adds the column names to the < table > element. (It is looking for the all columns, which is UNION of the column names). Traverse the JSON data and match key with the column name.

Does Hive support JSON format?

JSON processing capabilities are now available in Hive out-of-the-box. Each JSON object must be flattened to fit into one-line (does not support new-line characters).

Can you store JSON in HDFS?

If you wanted to store a binary representation of your JSON into HDFS you would need to use a SequenceFile . Obviously you could write your own Writable for this but I feel it's just easier like this if you intend to have a simple String representation.


2 Answers

It's actually not necessary to use the JSON SerDe. There is a great blog post here (I'm not affiliated with the author in any way):

http://pkghosh.wordpress.com/2012/05/06/hive-plays-well-with-json/

Which outlines a strategy using the builtin-function json_tuple to parse the json at time of query (NOT at the time of table definition):

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-json_tuple

So basically, your table schema is simply to load each line as a single 'string' column and then extract the relevant json fields as needed on a per query basis. e.g. this query from that blog post:

SELECT b.blogID, c.email FROM comments a LATERAL VIEW json_tuple(a.value, 'blogID', 'contact') b  AS blogID, contact  LATERAL VIEW json_tuple(b.contact, 'email', 'website') c  AS email, website WHERE b.blogID='64FY4D0B28'; 

In my humble experience, this has proven more reliable (I encountered various cryptic issues dealing with the JSON serdes, especially with nested objects).

like image 56
Mike Repass Avatar answered Oct 02 '22 18:10

Mike Repass


You'll need to use a JSON serde in order for Hive to map your JSON to the columns in your table.

A really good example showing you how is here:

http://aws.amazon.com/articles/2855

Unfortunately the JSON serde supplied doesn't handle nested JSON very well so you might need to flatten your JSON in order to use it.

Here's an example of the correct syntax from the article:

create external table impressions (     requestBeginTime string, requestEndTime string, hostname string   )   partitioned by (     dt string   )   row format      serde 'com.amazon.elasticmapreduce.JsonSerde'     with serdeproperties (        'paths'='requestBeginTime, requestEndTime, hostname'     )   location 's3://my.bucket/' ; 
like image 33
seedhead Avatar answered Oct 02 '22 19:10

seedhead