Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse CustomWritable from text in Hadoop

Say I have timestamped values for specific users in text files, like

#userid; unix-timestamp; value
1; 2010-01-01 00:00:00; 10
2; 2010-01-01 00:00:00; 20
1; 2010-01-01 01:00:00; 11
2; 2010-01-01 01:00:00, 21
1; 2010-01-02 00:00:00; 12
2; 2010-01-02 00:00:00; 22

I have a custom class "SessionSummary" implementing readFields and write of WritableComparable. It's purpose is to sum up all values per user for each calendar day.

So the mapper maps the lines to each user, the reducer summarizes all values per day per user and outputs a SessionSummary as TextOutputFormat (using toString of SessionSummary, as tab-separated UTF-8 strings):

1; 2010-01-01; 21
2; 2010-01-01; 41
1; 2010-01-02; 12
2; 2010-01-02; 22

If I need to use these summary-entries for a second Map/Reduce stage, how should I parse this summary data to populate the members? Can I reuse the existing readFields and write-methods (of the WritableComparable interface implementation) by using the text String as DataInput somehow? This (obviously) did not work:

public void map(...) {
    SessionSummary ssw = new SessionSummary();
    ssw.readFields(new DataInputStream(new ByteArrayInputStream(value.getBytes("UTF-8"))));
}

In general: Is there a best practice to implement custom keys and values in Hadoop and make them easily reusable across several M/R stages, while keeping human-readable text output at every stage?

(Hadoop version is 0.20.2 / CDH3u3)

like image 244
thomers Avatar asked Mar 15 '12 14:03

thomers


1 Answers

The output format for your first MR job should be SequenceFileOutputFormat - this will store the Key/Values output from the reducer in a binary format, that can then be read back in, in your second MR job using SequenceFileInputFormat. Also make sure you set the outputKeyClass and outputValueClass on the Job accordingly.

The mapper in the second job then has SessionSummary (and whatever the value type is)

If you need to see the textual output from the first MR job, you can run the following on the output files in HDFS:

hadoop fs -libjars my-lib.jar -text output-dir/part-r-*

This will read in the sequence file Key/Value pairs and call toString() on both objects, tab separating them when outputting to stdout. The -libjars specifies where hadoop can find your custom Key / Value classes

like image 92
Chris White Avatar answered Oct 26 '22 02:10

Chris White