Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to load json snappy compressed in HIVE

I have a bunch of json snappy compressed files in HDFS. They are HADOOP snappy compressed (not python, cf other SO questions) and have nested structures.

Could not find a method to load them into into HIVE (using json_tuple) ?

Can I get some ressources/hints on how to load them

Previous references (does not have valid answers)

pyspark how to load compressed snappy file

Hive: parsing JSON

like image 714
tensor Avatar asked Oct 14 '20 12:10

tensor


2 Answers

  1. Put all files in HDFS folder and create external table on top of it. If files have names like .snappy Hive will automatically recognize them. You can specify SNAPPY output format for writing table:

set hive.exec.compress.output=true;
set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
set mapreduce.output.fileoutputformat.compress.type=BLOCK;
 


CREATE EXTERNAL TABLE mydirectory_tbl(
  id   string,
  name string
)
ROW FORMAT SERDE
  'org.openx.data.jsonserde.JsonSerDe'
LOCATION '/mydir' --this is HDFS/S3 location
;
  1. JSONSerDe can parse all complex structures, it is much easier than using json_tuple. Simple attributes in json are mapped to columns as is All in the square brackets [] is an array<>, in {} is a struct<> or map<>, complex types can be nested. Carefully read Readme: https://github.com/rcongiu/Hive-JSON-Serde. There is a section about nested structures and many examples of CREATE TABLE.

  2. If you still want to use json_tuple, then create table with single STRING column then parse using json_tuple. But it is much more difficult.

  3. All JSON records should be in single line (no newlines inside JSON objects, as well as \r) . The same is mentioned here https://github.com/rcongiu/Hive-JSON-Serde

like image 76
leftjoin Avatar answered Sep 28 '22 07:09

leftjoin


If your data is partitioned (ex. by date)

Create the table in Hive

CREATE EXTERNAL TABLE IF NOT EXISTS database.table (
  filename STRING,
  cnt BIGINT,
  size DOUBLE
) PARTITIONED BY (   \`date\` STRING ) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'  
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'folder/path/in/hdfs'

Recover the partition (before the recovery, the table seems to be empty)

MSCK REPAIR TABLE database.table

like image 41
François B. Avatar answered Sep 28 '22 07:09

François B.