Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Advantages of using NullWritable in Hadoop

What are the advantages of using NullWritable for null keys/values over using null texts (i.e. new Text(null)). I see the following from the «Hadoop: The Definitive Guide» book.

NullWritable is a special type of Writable, as it has a zero-length serialization. No bytes are written to, or read from, the stream. It is used as a placeholder; for example, in MapReduce, a key or a value can be declared as a NullWritable when you don’t need to use that position—it effectively stores a constant empty value. NullWritable can also be useful as a key in SequenceFile when you want to store a list of values, as opposed to key-value pairs. It is an immutable singleton: the instance can be retrieved by calling NullWritable.get()

I do not clearly understand how the output is written out using NullWritable? Will there be a single constant value in the beginning output file indicating that the keys or values of this file are null, so that the MapReduce framework can ignore reading the null keys/values (whichever is null)? Also, how actually are null texts serialized?

Thanks,

Venkat

like image 208
Venk K Avatar asked Apr 24 '13 17:04

Venk K


People also ask

What is NullWritable in Hadoop?

NullWritable is a special type of Writable , as it has a zero-length serialization. No bytes are written to, or read from, the stream.

What does the term Writable signifies in Hadoop and MapReduce?

Writable data types are meant for writing the data to the local disk and it is a serialization format. Just like in Java there are data types to store variables (int, float, long, double,etc.), Hadoop has its own equivalent data types called Writable data types.

What is LongWritable in Hadoop?

Hadoop needs to be able to serialise data in and out of Java types via DataInput and DataOutputobjects (IO Streams usually). The Writable classes do this by implementing two methods `write(DataOuput) and readFields(DataInput). Specifically LongWritable is a Writable class that wraps a java long.

What is context in Hadoop?

Context object: allows the Mapper/Reducer to interact with the rest of the Hadoop system. It includes configuration data for the job as well as interfaces which allow it to emit output. Applications can use the Context: to report progress. to set application-level status messages.


1 Answers

The key/value types must be given at runtime, so anything writing or reading NullWritables will know ahead of time that it will be dealing with that type; there is no marker or anything in the file. And technically the NullWritables are "read", it's just that "reading" a NullWritable is actually a no-op. You can see for yourself that there's nothing at all written or read:

NullWritable nw = NullWritable.get(); ByteArrayOutputStream out = new ByteArrayOutputStream(); nw.write(new DataOutputStream(out)); System.out.println(Arrays.toString(out.toByteArray())); // prints "[]"  ByteArrayInputStream in = new ByteArrayInputStream(new byte[0]); nw.readFields(new DataInputStream(in)); // works just fine 

And as for your question about new Text(null), again, you can try it out:

Text text = new Text((String)null); ByteArrayOutputStream out = new ByteArrayOutputStream(); text.write(new DataOutputStream(out)); // throws NullPointerException System.out.println(Arrays.toString(out.toByteArray())); 

Text will not work at all with a null String.

like image 193
Joe K Avatar answered Oct 11 '22 19:10

Joe K