How to load json snappy compressed in HIVE

Question

I have a bunch of json snappy compressed files in HDFS. They are HADOOP snappy compressed (not python, cf other SO questions) and have nested structures.

Could not find a method to load them into into HIVE (using json_tuple) ?

Can I get some ressources/hints on how to load them

Previous references (does not have valid answers)

pyspark how to load compressed snappy file

Hive: parsing JSON

leftjoin · Accepted Answer

Put all files in HDFS folder and create external table on top of it. If files have names like .snappy Hive will automatically recognize them. You can specify SNAPPY output format for writing table:


set hive.exec.compress.output=true;
set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
set mapreduce.output.fileoutputformat.compress.type=BLOCK;
 


CREATE EXTERNAL TABLE mydirectory_tbl(
  id   string,
  name string
)
ROW FORMAT SERDE
  'org.openx.data.jsonserde.JsonSerDe'
LOCATION '/mydir' --this is HDFS/S3 location
;

JSONSerDe can parse all complex structures, it is much easier than using json_tuple. Simple attributes in json are mapped to columns as is All in the square brackets [] is an array<>, in {} is a struct<> or map<>, complex types can be nested. Carefully read Readme: https://github.com/rcongiu/Hive-JSON-Serde. There is a section about nested structures and many examples of CREATE TABLE.
If you still want to use json_tuple, then create table with single STRING column then parse using json_tuple. But it is much more difficult.
All JSON records should be in single line (no newlines inside JSON objects, as well as \r) . The same is mentioned here https://github.com/rcongiu/Hive-JSON-Serde

François B. · Answer

If your data is partitioned (ex. by date)

Create the table in Hive

CREATE EXTERNAL TABLE IF NOT EXISTS database.table (
  filename STRING,
  cnt BIGINT,
  size DOUBLE
) PARTITIONED BY (   \`date\` STRING ) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'  
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'folder/path/in/hdfs'

Recover the partition (before the recovery, the table seems to be empty)

MSCK REPAIR TABLE database.table

How to load json snappy compressed in HIVE

Tags:

json

apache-spark

hadoop

hive

snappy

tensor

2 Answers

leftjoin

Create the table in Hive

Recover the partition (before the recovery, the table seems to be empty)

François B.

Recent Activity

Donate For Us

How to load json snappy compressed in HIVE

Tags:

json

apache-spark

hadoop

hive

snappy

tensor

2 Answers

leftjoin

Create the table in Hive

Recover the partition (before the recovery, the table seems to be empty)

François B.

Related questions

Recent Activity

Donate For Us