Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to make hive take only specific files as input from hdfs folder

Tags:

hadoop

hive

I have the following folder structure in hdfs

  /input/data/yyyy/mm/dd/

and inside it data files, for example:

/input/data/2013/05/01/
      file_2013_05_01_01.json // file format yyyy_mm_dd_hh
      file_2013_05_01_02.json // file format yyyy_mm_dd_hh
      ....

I've defined hive external table for this folder:

CREATE EXTERNAL TABLE input_data (
    vr INT, ....
)
PARTITIONED BY (tsp STRING)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
STORED AS TEXTFILE;

adding for each folder a partition as following:

   alter table input_data ADD PARTITION (tsp="2013-05-01") LOCATION '/input/data/2013/05/01/';

The following query will take as input all files in date 2013-05-01

select ... from input_data where tps="2013-05-01"

How can I take only files of specific hour? without changing the hdfs structure to put each hour in separate folder?

like image 252
Julias Avatar asked May 30 '13 19:05

Julias


People also ask

How do I import data into folder and hive?

LOAD DATA [LOCAL] INPATH '<The table data location>' [OVERWRITE] INTO TABLE <table_name>; Note: The LOCAL Switch specifies that the data we are loading is available in our Local File System. If the LOCAL switch is not used, the hive will consider the location as an HDFS path location.

What inputs we have to give to submit hive query?

One is INPUT__FILE__NAME , which is the input file's name for a mapper task. the other is BLOCK__OFFSET__INSIDE__FILE , which is the current global file position. For block compressed file, it is the current block's file offset, which is the current block's first byte's file offset. Since Hive 0.8.


2 Answers

You could make use of a virtual column called INPUT__FILE__NAME. It is one of the 2 two virtual columns provided by Hive 0.8.0 and onward and represents the input file's name for a mapper task. So you could do something like this :

select ... from input_data 
where tps="2013-05-01" 
and INPUT__FILE__NAME='file_2013_05_01_01.json';

HTH

like image 82
Tariq Avatar answered Sep 25 '22 11:09

Tariq


You could make use of the following construct:

SELECT 
   *
FROM
   my_input_data
WHERE
   INPUT__FILE__NAME LIKE '%hh.json';

Here hh is your desired hour and INPUT__FILE__NAME is the virtual column available to hive queries while processing a given file.

like image 23
Amar Avatar answered Sep 23 '22 11:09

Amar