I'm trying to understand how to work with s3 inventory. I'm following this tutorial
After loading inventory lists into my table I'm trying to query it and find two issues.
1) SELECT key, size FROM table;
Size column for all records show a magic number(value) 4923069104295859283
2) select * from table;
Query Id: cf07c309-c685-4bf4-9705-8bca69b00b3c
.
Receiving error:
HIVE_BAD_DATA: Field size's type LONG in ORC is incompatible with type varchar defined in table schema
Here is my table schema:
CREATE EXTERNAL TABLE `table`(
`bucket` string,
`key` string,
`version_id` string,
`is_latest` boolean,
`is_delete_marker` boolean,
`size` bigint,
`last_modified_date` timestamp,
`e_tag` string,
`storage_class` string)
PARTITIONED BY (
`dt` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://......../hive'
TBLPROPERTIES (
'transient_lastDdlTime'='1516093603')
The following command of any of your orc file coming from the inventory generated by AWS S3 will give you the actual structure of your inventory:
$> hive --orcfiledump ~/Downloads/017c2014-1205-4431-a30d-2d9ae15492d6.orc
...
Processing data file /tmp/017017c2014-1205-4431-a30d-2d9ae15492d6.orc [length: 4741786]
Structure for /mp/017c2014-1205-4431-a30d-2d9ae15492d6.orc
File Version: 0.12 with ORC_135
Rows: 223473
Compression: ZLIB
Compression size: 262144
Type: struct<bucket:string,key:string,size:bigint,last_modified_date:timestamp,e_tag:string,storage_class:string,is_multipart_uploaded:boolean,replication_status:string,encryption_status:string>
...
It appears that the example provided by aws here expects that your inventory is not just for the current version
but for all versions
of objects in your bucket.
The right table structure for Athena
is then for an encrypted bucket:
CREATE EXTERNAL TABLE inventory(
bucket string,
key string,
version_id string,
is_latest boolean,
is_delete_marker boolean,
size bigint,
last_modified_date timestamp,
e_tag string,
storage_class string,
is_multipart_uploaded boolean,
replication_status string,
encryption_status string
)
PARTITIONED BY (dt string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION 's3://............/hive'
TBLPROPERTIES ('has_encrypted_data'='true');
The error in our case was using 'Current Version' only in the configuration. The difference between 'current' and 'all' versions configuration is the absence of columns:
Example hive output for ORC format:
> hive --orcfiledump ./inventoryexamplefilewithcurrentversiononly.orc
Type: struct<bucket:string,key:string,size:bigint,last_modified_date:timestamp,e_tag:string,storage_class:string,is_multipart_uploaded:boolean,replication_status:string,encryption_status:string,object_lock_retain_until_date:timestamp,object_lock_mode:string,object_lock_legal_hold_status:string>
Create the table for the ORC format:
-- Create table IF USING 'CURRENT VERSION' only in S3 inventory config
CREATE EXTERNAL TABLE your_table_name(
`bucket` string,
key string,
size bigint,
last_modified_date timestamp,
e_tag string,
storage_class string,
is_multipart_uploaded boolean,
replication_status string,
encryption_status string,
object_lock_retain_until_date timestamp,
object_lock_mode string,
object_lock_legal_hold_status string
)
PARTITIONED BY (dt string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION 's3://.../hive/';
Also, as an aside, if you're using "SSE-KMS" (i.e. your own KMS key) and not "SSE-S3" encryption, you don't need 'has_encrypted_data' configured.
Note
With SSE-KMS, Athena does not require you to indicate that data is encrypted when creating a table.
Source: https://docs.aws.amazon.com/athena/latest/ug/encryption.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With