How to get block cyclic distribution?

Question

I am trying to distribute my matrix in block cyclic fashion. I learned a lot from this question (MPI IO Reading and Writing Block Cyclic Matrix), but that is not what I really need.

Let me explain my problem.

Suppose I have this matrix of dimension 12 x 12 which I want to distribute over a processor grid of dimension 2 x 3 such that first processor gets bolded elements:

A =

     1     2     3     4     5     6     7     8     9    10    11    12
    13    14    15    16    17    18    19    20    21    22    23    24
    25    26    27    28    29    30    31    32    33    34    35    36
    37    38    39    40    41    42    43    44    45    46    47    48
    49    50    51    52    53    54    55    56    57    58    59    60
    61    62    63    64    65    66    67    68    69    70    71    72
    73    74    75    76    77    78    79    80    81    82    83    84
    85    86    87    88    89    90    91    92    93    94    95    96
    97    98    99   100   101   102   103   104   105   106   107   108
   109   110   111   112   113   114   115   116   117   118   119   120
   121   122   123   124   125   126   127   128   129   130   131   132
   133   134   135   136   137   138   139   140   141   142   143   144

So, basically, I want to partition my matrix in blocks of dimensions 2 x 2, and then distribute those blocks to processors (numbered from 1 to 6) in this way:

1 2 3 1 2 3
4 5 6 4 5 6
1 2 3 1 2 3
4 5 6 4 5 6

I tried to achieve that as in above linked question, but the problem is that my local array for first processor is formed column-wise, i.e. it look like this

1, 13, 49, 61, 97, 109, 2, 14, 50, 62, 98, 110, 7, 19, 55, 67, 103, 115, 8, 20, 56, 68, 104, 116

This is my C code:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include "mpi.h"

#define     N           12
#define     P           2
#define     Q           3

int main(int argc, char **argv) {
    int rank;
    int size;

    double *A;
    int A_size;
    
    MPI_Datatype filetype;
    MPI_File fin;

    MPI_Status status;

    MPI_Init(&argc, &argv);

    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    /**
     * Reading from file.
     */
    int gsizes[2], distribs[2], dargs[2], psizes[2];

    gsizes[0] = N; /* no. of rows in global array */
    gsizes[1] = N; /* no. of columns in global array*/

    distribs[0] = MPI_DISTRIBUTE_CYCLIC;
    distribs[1] = MPI_DISTRIBUTE_CYCLIC;

    dargs[0] = 2; // no of rows in block
    dargs[1] = 2; // no of cols in block

    psizes[0] = P; /* no. of processes in vertical dimension
     of process grid */
    psizes[1] = Q; /* no. of processes in horizontal dimension
     of process grid */

    MPI_Type_create_darray(P * Q, rank, 2, gsizes, distribs, dargs, psizes,
            MPI_ORDER_FORTRAN, MPI_DOUBLE, &filetype);
    MPI_Type_commit(&filetype);

    MPI_File_open(MPI_COMM_WORLD, "A.txt",
            MPI_MODE_RDONLY,
            MPI_INFO_NULL, &fin);

    MPI_File_set_view(fin, 0, MPI_DOUBLE, filetype, "native",
            MPI_INFO_NULL);

    A_size = (N * N) / (P * Q);
    A = (double*) malloc(A_size * sizeof(double));
    MPI_File_read_all(fin, A, A_size,
            MPI_DOUBLE, &status);

    MPI_File_close(&fin);

    printf("
======
i = %d
", rank);
    printf("A : ");
    for (int i = 0; i &lt A_size; i++) {
        printf("%lg ", A[i]);
    }

    MPI_Finalize();
    return 0;
}

What I really want is that those 2 x 2 blocks are written consecutive, i.e. that local array of first processor looks like this;

1, 13, 2, 14, 49, 61, 50, 62, 97, 109, 98, 110, ...

I assume that I will need to define another MPI_Datatype (like vector or subarray), but I just cannot figure it out how would I achieve that.

Edit

I think I have partially solved my problem. Basically, each processor will end up with 4 x 6 matrix in FORTRAN order, and then with MPI_Create_subarray(...) I can easily extract 2 x 2 block.

But I want that each processor sends its block-row to each processor in same column and vice-versa. Processors are numbered in grid

1 2 3
4 5 6

so, for example, in first step, processor 1 should send its block-row

1  2  7  8
13 14 19 20

to processor 4; and its block-column

to processors 2 and 3.

I created Cartesian communicator, and used MPI_Cart_sub() to create row-wise and column-wise communicators, too.

I think I should use MPI_Bcast(), but I do not know how to combine MPI_Bcast() with MPI_Type_create_subarray(). I should first copy extracted subarray to some local_array and then Bcast(local_array). However, MPI_Type_create_subarray() gives me only "view" on subarray, not actually it, so the best solution I came up with is to Isend-Irecv root->root.

Is there a more elegant solution?

Alon Alush · Accepted Answer

When you use MPI_Type_create_darray(..., MPI_ORDER_FORTRAN, ...), MPI reads the matrix in Fortran order. If you print out the local array in C, it can look “column-first” or scrambled compared to what you expect in typical C row-major arrays.

If you really want the local blocks laid out in row-major order in your local memory, just tell MPI that you want C ordering.

Use MPI_ORDER_C in MPI_Type_create_darray(...) to get a row-major local array.

For example:

MPI_Type_create_darray(
P*Q, // total number of processes
rank, // this rank
2, // dimension = 2D
gsizes, // global sizes [N, N]
distribs, // [MPI_DISTRIBUTE_CYCLIC, MPI_DISTRIBUTE_CYCLIC]
dargs, // [2, 2] block sizes
psizes, // [P, Q]
MPI_ORDER_C, // <<-- use C ordering
MPI_DOUBLE,
&filetype
);

This will often produce a local array that is more consistent with typical “C-style” row-major blocks.

To summarize:

Use MPI_ORDER_C when creating your darray if you want each rank’s local data in a typical “row-major” layout.

For broadcasting or sending sub-blocks, you have two major approaches:

(A) Make a temporary contiguous buffer for the sub-block. Use MPI_Bcast with a normal datatype (e.g. MPI_DOUBLE). This is often simpler to code at first.

(B) Define a subarray (or vector) derived datatype that corresponds exactly to the sub-block location and size in your local 2D array. Then do MPI_Bcast(..., 1, sub_block_type, ...).

Either way works.

How to get block cyclic distribution?

Tags:

c

io

matrix

block

mpi

matiska

1 Answers

Alon Alush

Recent Activity

Donate For Us

How to get block cyclic distribution?

Tags:

c

io

matrix

block

mpi

matiska

1 Answers

Alon Alush

Related questions

Recent Activity

Donate For Us