Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop Matrix Multiplication

I was running the MapReduce Matrix Multiplication program found at http://www.norstad.org/matrix-multiply/index.html. I found out that this implementation doesn't work properly when there are 0's in the input matrices. But I don't understand why, and how do I modify the program to make it work? The MapReduce operation completes, but the output is always a matrix with all elements 0.

My input Matrices A & B are:

Matrix A     Matrix B
0 0 0        6 7 4 
0 1 6        9 1 3 
7 8 9        7 6 2  

Output Matrix:

0 0 0
0 0 0
0 0 0

The following are taken from the log files for the job:

Map Output for matrixB:

##### Map setup: matrixA = false for hdfs://localhost/user/hadoop-user/B
strategy = 4
R1 = 4
I = 3
K = 3
J = 3
IB = 3
KB = 3
JB = 3
##### Map input: (0,0) 6
##### Map output: (0,0,0,1) (0,0,6) 
##### Map input: (0,1) 7
##### Map output: (0,0,0,1) (0,1,7) 
##### Map input: (0,2) 4
##### Map output: (0,0,0,1) (0,2,4) 
##### Map input: (1,0) 9
##### Map output: (0,0,0,1) (1,0,9) 
##### Map input: (1,1) 1
##### Map output: (0,0,0,1) (1,1,1) 
##### Map input: (1,2) 3
##### Map output: (0,0,0,1) (1,2,3) 
##### Map input: (2,0) 7
##### Map output: (0,0,0,1) (2,0,7) 
##### Map input: (2,1) 6
##### Map output: (0,0,0,1) (2,1,6) 
##### Map input: (2,2) 2
##### Map output: (0,0,0,1) (2,2,2) 

Map output for Matrix A:

##### Map setup: matrixA = true for hdfs://localhost/user/hadoop-user/A
strategy = 4
R1 = 4
I = 3
K = 3
J = 3
IB = 3
KB = 3
JB = 3
##### Map input: (1,1) 1
##### Map output: (0,0,0,0) (1,1,1) 
##### Map input: (1,2) 6
##### Map output: (0,0,0,0) (1,2,6) 
##### Map input: (2,0) 7
##### Map output: (0,0,0,0) (2,0,7) 
##### Map input: (2,1) 8
##### Map output: (0,0,0,0) (2,1,8) 
##### Map input: (2,2) 9
##### Map output: (0,0,0,0) (2,2,9) 

Notice that the Map for matrix A does not read the zeros from the input file. Note: both input files are stored in HDFS as sequence files, and the output is a sequence file too. Can someone shed some light on what the problem is? Thanks!

The code for the Mapper class is given below:

/** The job 1 mapper class. */

private static class Job1Mapper 
    extends Mapper<IndexPair, IntWritable, Key, Value>
{
    private Path path;
    private boolean matrixA;
    private Key key = new Key();
    private Value value = new Value();

    public void setup (Context context) {
        init(context);
        FileSplit split = (FileSplit)context.getInputSplit();
        path = split.getPath();
        matrixA = path.toString().startsWith(inputPathA);
        if (DEBUG) {
            System.out.println("##### Map setup: matrixA = " + matrixA + " for " + path);
            System.out.println("   strategy = " + strategy);
            System.out.println("   R1 = " + R1);
            System.out.println("   I = " + I);
            System.out.println("   K = " + K);
            System.out.println("   J = " + J);
            System.out.println("   IB = " + IB);
            System.out.println("   KB = " + KB);
            System.out.println("   JB = " + JB);
        }
    }

    private void printMapInput (IndexPair indexPair, IntWritable el) {
        System.out.println("##### Map input: (" + indexPair.index1 + "," + 
            indexPair.index2 + ") " + el.get());
    }

    private void printMapOutput (Key key, Value value) {
        System.out.println("##### Map output: (" + key.index1 + "," + 
            key.index2 + "," + key.index3 + "," + key.m + ") (" + 
            value.index1 + "," + value.index2 + "," + value.v + ") ");
    }

    private void badIndex (int index, int dim, String msg) {
        System.err.println("Invalid " + msg + " in " + path + ": " + index + " " + dim);
        System.exit(1);
    }

    public void map (IndexPair indexPair, IntWritable el, Context context)
        throws IOException, InterruptedException 
    {
        if (DEBUG) printMapInput(indexPair, el);
        int i = 0;
        int k = 0;
        int j = 0;
        if (matrixA) {
            i = indexPair.index1;
            if (i < 0 || i >= I) badIndex(i, I, "A row index");
            k = indexPair.index2;
            if (k < 0 || k >= K) badIndex(k, K, "A column index");
        } else {
            k = indexPair.index1;
            if (k < 0 || k >= K) badIndex(k, K, "B row index");
            j = indexPair.index2;
            if (j < 0 || j >= J) badIndex(j, J, "B column index");
        }
        value.v = el.get();
                if (matrixA) {
                    key.index1 = i/IB;
                    key.index3 = k/KB;
                    key.m = 0;
                    value.index1 = i % IB;
                    value.index2 = k % KB;
                    for (int jb = 0; jb < NJB; jb++) {
                        key.index2 = jb;
                        context.write(key, value);
                        if (DEBUG) printMapOutput(key, value);
                    }
                } else {
                    key.index2 = j/JB;
                    key.index3 = k/KB;
                    key.m = 1;
                    value.index1 = k % KB;
                    value.index2 = j % JB;
                    for (int ib = 0; ib < NIB; ib++) {
                        key.index1 = ib;
                        context.write(key, value);
                        if (DEBUG) printMapOutput(key, value);
                    }
        }
    }
}

If there are any syntax errors, it's only because I copy-pasted the code and modified it here, so I might have overlooked something.

I need help with the logic of the Map function, which does not read 0's from the input file, can anyone tell me why?

like image 654
Chaos Avatar asked Oct 23 '22 05:10

Chaos


1 Answers

In TestMatrixMultiply.java, from the web site you linked, which presumably contains the code you're using to encode your matrices into the expected IndexPair-backed file format, there is the function writeMatrix:

public static void writeMatrix (int[][] matrix, int rowDim, int colDim, String pathStr)
    throws IOException
{
    Path path = new Path(pathStr);
    SequenceFile.Writer writer = SequenceFile.createWriter(fs, conf, path, 
        MatrixMultiply.IndexPair.class, IntWritable.class, 
        SequenceFile.CompressionType.NONE);
    MatrixMultiply.IndexPair indexPair = new MatrixMultiply.IndexPair();
    IntWritable el = new IntWritable();
    for (int i = 0; i < rowDim; i++) {
        for (int j = 0; j < colDim; j++) {
            int v = matrix[i][j];
            if (v != 0) { // !!! well, that would be why we aren't writing 0s
                indexPair.index1 = i;
                indexPair.index2 = j;
                el.set(v);
                writer.append(indexPair, el);
            }
        }
    }
    writer.close();
}

Comment inserted in the second line of the inner for loop.

Your mapper is reading no 0s because your input files contain no 0s.

The code is heavily designed to assume that all missing values are 0, and it performs additional checks to avoid emitting 0s, to attempt to minimize network traffic.

Everything below here is mostly wrong because I misunderstood the algorithm
(the part above is still useful, though)

From the linked page, you're using strategy 3. Strategy 3 relies on partitioner behavior and sort order. Unfortunately, the partitioner is wrong! This is separate from the 0s not being printed out. The partitioner is just straight-up wrong here, and you're getting matrices full of 0s because it's multiplying by data that was previously initialized to 0 and never overwritten with the correct data for the block. This is hidden on 1-machine operation, because the partitioner is a null operation, but breaks in most clusters.

The partitioner maps intermediate key (kb, jb, ib) to a reducer r as follows:

r = (jb*KB + kb) mod R

It needs to guarantee that all keys for the same block are assigned to the same reducer. Unfortunately, it guarantees that this will not happen unless KB % numReducers == 0:

map (key, value)
   if from matrix A with key=(i,k) and value=a(i,k)
      for 0 <= jb < NJB
         emit (k/KB, jb, i/IB), (i mod IB, k mod KB, a(k,j)) // compare this...
   if from matrix B with key=(k,j) and value=b(k,j)
       emit (k/KB, j/JB, -1), (k mod KB, j mod KB, b(k,j))  // ...to this

For matrix A, key jb is being iterated. For matrix B, key jb is being calculated. Since assignment to a reducer is round-robin, this guarantees that the A-matrix keys will not be assigned to the same reducer as the B-matrix keys. Therefore, the algorithm fails, since it assumes something about assignment and ordering of keys that is provably incorrect. It's correct about key order if and only if there is one reducer to which all keys are assigned, but the partitioner is wrong!

The partitioner must be modified to use kb % numReducers for Strategy 3. This is not a very good partitioner, but it is the only partitioner that will work because non-identical keys are required to be sorted to the same reducer in a particular order.

The code that you have actually put inside the question is unrelated to where the bug actually exists.

like image 98
Adam Norberg Avatar answered Oct 27 '22 10:10

Adam Norberg