I am trying to load data from Gzip archive into Hive table but my gzip files have extension like for example:
apache_log.gz_localhost
When I specify HDFS directory location where these files are located Hive doesn't recognize GZip compressed files because it is searching for files with .gz extension.
Is it possible to define file type when loading data into Hive? Something like (PSEUDO):
set input.format=gzip;
LOAD DATA INPATH /tmp/logs/ INTO TABLE apache_logs;
Here is my SQL for table creation:
CREATE EXTERNAL TABLE access_logs (
`ip` STRING,
`time_local` STRING,
`method` STRING,
`request_uri` STRING,
`protocol` STRING,
`status` STRING,
`bytes_sent` STRING,
`referer` STRING,
`useragent` STRING,
`bytes_received` STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'input.regex'='^(\\S+) \\S+ \\S+ \\[([^\\[]+)\\] "(\\w+) (\\S+) (\\S+)" (\\d+) (\\d+|\-) "([^"]+)" "([^"]+)".* (\\d+)'
)
STORED AS TEXTFILE
LOCATION '/tmp/logs/';
Why not change file name to xxx.gz
after put in HDFS?
If you really wanna support .gz_localhost
, I think you can custom your own GzipCodec
to relize it:
Create a your own NewGzipCodec
Class which extend GzipCodec
:
public class NewGzipCodec extends org.apache.hadoop.io.compress.GzipCodec { }
override method getDefaultExtension
:
public String getDefaultExtension() { return ".gz_locahost"; }
javac and compress NewGzipCodec.class
to NewGzipCodec.jar
upload NewGzipCodec.jar
to {$HADOOP_HOME}/lib
set up your core-site.xml
<property> <name>io.compression.codecs</name> <value>NewGzipCodec, org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec</value> </property>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With