Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Serialization using ArrayWritable seems to work in a funny way

I was working with ArrayWritable, at some point I needed to check how Hadoop serializes the ArrayWritable, this is what I got by setting job.setNumReduceTasks(0):

0    IntArrayWritable@10f11b8
3    IntArrayWritable@544ec1
6    IntArrayWritable@fe748f
8    IntArrayWritable@1968e23
11    IntArrayWritable@14da8f4
14    IntArrayWritable@18f6235

and this is the test mapper that I was using:

public static class MyMapper extends Mapper<LongWritable, Text, LongWritable, IntArrayWritable> {

    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        int red = Integer.parseInt(value.toString());
        IntWritable[] a = new IntWritable[100];

        for (int i =0;i<a.length;i++){
            a[i] = new IntWritable(red+i);
        }

        IntArrayWritable aw = new IntArrayWritable();
        aw.set(a);
        context.write(key, aw);
    }
}

IntArrayWritable is taken from the example given in the javadoc: ArrayWritable.

import org.apache.hadoop.io.ArrayWritable;
import org.apache.hadoop.io.IntWritable;

public class IntArrayWritable extends ArrayWritable {
    public IntArrayWritable() {
        super(IntWritable.class);
    }
}

I actually checked on the source code of Hadoop and this makes no sense to me. ArrayWritable should not serialize the class name and there is no way that an array of 100 IntWritable can be serialized using 6/7 hexadecimal values. The application actually seems to work just fine and the reducer deserializes the right values... What is happening? What am I missing?

like image 632
igon Avatar asked Oct 27 '11 16:10

igon


People also ask

What is serialization in Java?

Java - Serialization. Java provides a mechanism, called object serialization where an object can be represented as a sequence of bytes that includes the object's data as well as information about the object's type and the types of data stored in the object. After a serialized object has been written into a file, it can be read from the file and ...

Is it necessary to serialize data in a program?

This is necessary given the way that serialization works, but it is not obvious behaviour. This, and other things, means that using serialization can be non-intuitive, and very hard to debug. Although .NET provides a number of quick and easy ways to serialize and deserialize data, do not use them.

What happens after a serialized object is written into a file?

After a serialized object has been written into a file, it can be read from the file and deserialized that is, the type information and bytes that represent the object and its data can be used to recreate the object in memory.

Why should you not use the simple XML serialization approach?

There are a number of reasons why you should not opt for the simple approach. Here are nine important ones. 1. It forces you to design your classes a certain way XML serialization only works on public methods and fields, and on classes with public constructors. That means your classes need to be accessible to the outside world.


1 Answers

You have to override the default toString() method.

It's called by the TextOutputFormat to create a human readable format.

Try out the following code and see the result:

public class IntArrayWritable extends ArrayWritable {
    public IntArrayWritable() {
        super(IntWritable.class);
    }

    @Override
    public String toString() {
        StringBuilder sb = new StringBuilder();
        for (String s : super.toStrings())
        {
            sb.append(s).append(" ");
        }
        return sb.toString();
    }
}
like image 135
Le Duc Duy Avatar answered Oct 05 '22 22:10

Le Duc Duy