Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using FileFormat v Serde to read custom text files

Tags:

hive

Hadoop/Hive newbie here. I am trying to use data stored in a custom text-based format with Hive. My understanding is you can either write a custom FileFormat or a custom SerDe class to do that. Is that the case or am I misunderstanding it? And what are some general guidelines on which option to choose when? Thanks!

like image 864
radimd Avatar asked Oct 12 '11 01:10

radimd


3 Answers

I figured it out. I did not have to write a serde after all, wrote a custom InputFormat (extends org.apache.hadoop.mapred.TextInputFormat) which returns a custom RecordReader (implements org.apache.hadoop.mapred.RecordReader<K, V>). The RecordReader implements logic to read and parse my files and returns tab delimited rows.

With that I declared my table as

create table t2 ( 
field1 string, 
..
fieldNN float)        
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'    
STORED AS INPUTFORMAT 'namespace.CustomFileInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

This uses a native SerDe. Also, it is required to specify an output format when using a custom input format, so I choose one of the built-in output formats.

like image 91
radimd Avatar answered Jun 29 '23 06:06

radimd


Basically you need to understand the difference that when to modify SerDe and and when to modify fileformat.

From official documentation: Hive SerDe

What is a SerDe? 1.SerDe is a short name for "Serializer and Deserializer." 2.Hive uses SerDe (and FileFormat) to read and write table rows. 3.HDFS files-->InputFileFormat--> --> Deserializer --> Row object 4.Row object -->Serializer --> --> OutputFileFormat --> HDFS files

So,3rd and 4th points are clearly inferring the difference. You need to have custom fileformat(input/output) when you want to read a record in a different way than usual(where records are separated by '\n'). And you need to have customize SerDe when you want to interpret the read records in a custom way.

Let's take an example of commonly used format JSON.

Scenario 1: Let's say you have an input json file where one line contains one json record. So,now you just needs Custom Serde to interpret the read record in a way you want. No need of custom inout format as 1 line will be 1 record.

Scenario 2: Now if you have an input file where your one json record spans across multiple lines and you want to read it as it is then you should first write a custom input format to read in 1 json record and then this read json record will go to Custom SerDe.

like image 42
Harry Kumar Avatar answered Jun 29 '23 06:06

Harry Kumar


Depends on what you're getting from your text file.

You can write a custom record reader to parse the text log file and return the way you want, Input format class does that job for you. You will use this jar to create the Hive table and load the data in that table.

Talking about SerDe, I use it a little differently. I use both InputFormat and SerDe, former to parse the actual data and latter to get stabilization to my metadata which represents actual data. Reason why I do that? I want to create appropriate columns(not more or less) in hive table for each row of my log file I have and I think SerDe is the perfect solution for me.

Eventually I map those two to create a final table if I want or keep those tables as it is so that I can do joins to query from those.

I like the explanation of Cloudera blog.

http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/

like image 28
Raviteja Chirala Avatar answered Jun 29 '23 07:06

Raviteja Chirala