Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Athena and S3 Inventory. HIVE_BAD_DATA: Field size's type LONG in ORC is incompatible with type varchar defined in table schema

I'm trying to understand how to work with s3 inventory. I'm following this tutorial

After loading inventory lists into my table I'm trying to query it and find two issues.

1) SELECT key, size FROM table; Size column for all records show a magic number(value) 4923069104295859283

2) select * from table; Query Id: cf07c309-c685-4bf4-9705-8bca69b00b3c.

Receiving error:

HIVE_BAD_DATA: Field size's type LONG in ORC is incompatible with type varchar defined in table schema

Here is my table schema:

CREATE EXTERNAL TABLE `table`(
`bucket` string, 
`key` string, 
`version_id` string, 
`is_latest` boolean, 
`is_delete_marker` boolean, 
`size` bigint, 
`last_modified_date` timestamp, 
`e_tag` string, 
`storage_class` string)
PARTITIONED BY ( 
`dt` string)
ROW FORMAT SERDE 
'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
STORED AS INPUTFORMAT 
'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat' 
OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://......../hive'
TBLPROPERTIES (
'transient_lastDdlTime'='1516093603')
like image 841
duzvik Avatar asked Jan 16 '18 14:01

duzvik


2 Answers

The following command of any of your orc file coming from the inventory generated by AWS S3 will give you the actual structure of your inventory:

$> hive --orcfiledump ~/Downloads/017c2014-1205-4431-a30d-2d9ae15492d6.orc
...
Processing data file /tmp/017017c2014-1205-4431-a30d-2d9ae15492d6.orc [length: 4741786]
Structure for /mp/017c2014-1205-4431-a30d-2d9ae15492d6.orc
File Version: 0.12 with ORC_135
Rows: 223473
Compression: ZLIB
Compression size: 262144
Type: struct<bucket:string,key:string,size:bigint,last_modified_date:timestamp,e_tag:string,storage_class:string,is_multipart_uploaded:boolean,replication_status:string,encryption_status:string>
...

It appears that the example provided by aws here expects that your inventory is not just for the current version but for all versions of objects in your bucket.

The right table structure for Athena is then for an encrypted bucket:

CREATE EXTERNAL TABLE inventory(
  bucket string,
  key string,
  version_id string,
  is_latest boolean,
  is_delete_marker boolean,
  size bigint,
  last_modified_date timestamp,
  e_tag string,
  storage_class string,
  is_multipart_uploaded boolean,
  replication_status string,
  encryption_status string
  )
  PARTITIONED BY (dt string)
  ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
  STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
  OUTPUTFORMAT  'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
  LOCATION 's3://............/hive'
  TBLPROPERTIES ('has_encrypted_data'='true');
like image 119
herve Avatar answered Sep 20 '22 15:09

herve


The error in our case was using 'Current Version' only in the configuration. The difference between 'current' and 'all' versions configuration is the absence of columns:

  • version_id
  • is_latest
  • is_delete_marker

Example hive output for ORC format:

> hive --orcfiledump ./inventoryexamplefilewithcurrentversiononly.orc
Type: struct<bucket:string,key:string,size:bigint,last_modified_date:timestamp,e_tag:string,storage_class:string,is_multipart_uploaded:boolean,replication_status:string,encryption_status:string,object_lock_retain_until_date:timestamp,object_lock_mode:string,object_lock_legal_hold_status:string>

Create the table for the ORC format:

-- Create table IF USING 'CURRENT VERSION' only in S3 inventory config
CREATE EXTERNAL TABLE your_table_name(
  `bucket` string,
  key string,
  size bigint,
  last_modified_date timestamp,
  e_tag string,
  storage_class string,
  is_multipart_uploaded boolean,
  replication_status string,
  encryption_status string,
  object_lock_retain_until_date timestamp,
  object_lock_mode string,
  object_lock_legal_hold_status string
  )
  PARTITIONED BY (dt string)
  ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
  STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
  OUTPUTFORMAT  'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
  LOCATION 's3://.../hive/';

Also, as an aside, if you're using "SSE-KMS" (i.e. your own KMS key) and not "SSE-S3" encryption, you don't need 'has_encrypted_data' configured.

Note

With SSE-KMS, Athena does not require you to indicate that data is encrypted when creating a table.

Source: https://docs.aws.amazon.com/athena/latest/ug/encryption.html

like image 25
BeepBoop Avatar answered Sep 18 '22 15:09

BeepBoop