Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MPI IO Reading and Writing Block Cyclic Matrix

Tags:

c

io

mpi

I have a school project to do matrix multiplication on a hpc distributed system.

I need to read in a matrix from a parallel IO system and use pblacs to perform the matrix multiplication in parallel on many compute nodes(processors). The data must be read in using MPI IO commands. I know PBlacs uses block cyclic distributions to perform the multiplication.

The professor has not given us much info on MPI IO, and I am having trouble finding much information/resources on it. Specifically, are there ways to read in a matrix from a parallel io system in a block cyclic manner and easily plug that into pblacs pdgemm?

Any pointers to useful resources would be much appreciated. I am a bit short on time, and getting frustrated with the lack of direction on this project.

like image 644
James Cotter Avatar asked Apr 26 '12 21:04

James Cotter


1 Answers

This is actually relatively straightforward to do (if you already know something about blacs/scalapack and mpi-io!) but even then the documentation - even online - is as you've discovered, somewhat poor.

The first thing to know about MPI-IO is that it lets you use normal MPI data types to specify each process' "view" of the file, and then read only the data that falls into that view. At our centre we have slides and source code for a half-day course on parallel IO; the first third or so is about the basics of MPI-IO. There are slides here and sample source code here.

The second thing to know is that MPI has a built-in way to create "distributed array" data types, one combination of which lets you lay out a block-cyclic data distribution; that's discussed in general terms in my answer to this question: What is the difference between darray and subarray in mpi? .

So that means if you have a binary file containing a big matrix, you can read it in with mpi-io using MPI_Type_create_darray and it'll be distributed by tasks in a block-cyclic way. Then it's just a matter of doing the blacs or scalapack call. An example program of using the scalapack psgemv for matrix-vector multiplication rather than psgemm is listed in my answer to a question on the Computational Science stack exchange.

To give you an idea of how the pieces fit together, the following is a simple program which reads in a binary file containing a matrix (first the size of the square matrix N and then the N^2 elements) and then calculates the eigenvalues and vectors using scalapack's (new) pssyevr routine. It combines the MPI-IO, darray, and scalapack stuff. It's in Fortran, but the function calls are the same in C-based languages.

!
! Use MPI-IO to read a diagonal matrix distributed block-cyclically,
! use Scalapack to calculate its eigenvalues, and compare
! to expected results.
!
program darray
      use mpi
      implicit none

      integer :: n, nb    ! problem size and block size
      integer :: myArows, myAcols   ! size of local subset of global array
      real :: p
      real, dimension(:), allocatable :: myA, myZ
      real, dimension(:), allocatable :: work
      integer, dimension(:), allocatable :: iwork
      real, dimension(:), allocatable :: eigenvals
      real, dimension(:), allocatable :: expected
      integer :: worksize, totwork, iworksize

      integer, external :: numroc   ! blacs routine
      integer :: me, procs, icontxt, prow, pcol, myrow, mycol  ! blacs data
      integer :: info    ! scalapack return value
      integer, dimension(9)   :: ides_a, ides_z ! scalapack array desc
      integer :: clock
      real :: calctime, iotime

      character(len=128) :: filename
      integer :: mpirank
      integer :: ierr
      integer, dimension(2) :: pdims, dims, distribs, dargs
      integer :: infile
      integer, dimension(MPI_STATUS_SIZE) :: mpistatus
      integer :: darray
      integer :: locsize, nelements
      integer(kind=MPI_ADDRESS_KIND) :: lb, locextent
      integer(kind=MPI_OFFSET_KIND) :: disp
      integer :: nargs
      integer :: m, nz

! Initialize MPI (for MPI-IO)

      call MPI_Init(ierr)
      call MPI_Comm_size(MPI_COMM_WORLD,procs,ierr)
      call MPI_Comm_rank(MPI_COMM_WORLD,mpirank,ierr)

! May as well get the process grid from MPI_Dims_create
      pdims = 0
      call MPI_Dims_create(procs, 2, pdims, ierr)
      prow = pdims(1)
      pcol = pdims(2)

! get filename
      nargs = command_argument_count()
      if (nargs /= 1) then
          print *,'Usage: darray filename'
          print *,'       Where filename = name of binary matrix file.'
          call MPI_Abort(MPI_COMM_WORLD,1,ierr)
      endif
      call get_command_argument(1, filename)

! find the size of the array we'll be using

      call tick(clock)
      call MPI_File_open(MPI_COMM_WORLD, filename, MPI_MODE_RDONLY, MPI_INFO_NULL, infile, ierr)
      call MPI_File_read_all(infile,n,1,MPI_INTEGER,mpistatus,ierr)
      call MPI_File_close(infile,ierr)

! create the darray that will read in the data.

      nb = 64
      if (nb > (N/prow)) nb = N/prow
      if (nb > (N/pcol)) nb = N/pcol
      dims = [n,n]
      distribs = [MPI_DISTRIBUTE_CYCLIC, MPI_DISTRIBUTE_CYCLIC]
      dargs = [nb, nb]

      call MPI_Type_create_darray(procs, mpirank, 2, dims, distribs, dargs, &
                                  pdims, MPI_ORDER_FORTRAN, MPI_REAL, darray, ierr)
      call MPI_Type_commit(darray,ierr)

      call MPI_Type_size(darray, locsize, ierr)
      nelements = locsize/4
      call MPI_Type_get_extent(darray, lb, locextent, ierr)

! Initialize local arrays    

      allocate(myA(nelements))
      allocate(myZ(nelements))
      allocate(eigenvals(n), expected(n))

! read in the data
      call MPI_File_open(MPI_COMM_WORLD, trim(filename), MPI_MODE_RDONLY, MPI_INFO_NULL, infile, ierr)
      disp = 4   ! skip N = 4 bytes
      call MPI_File_set_view(infile, disp, MPI_REAL, darray, "native", MPI_INFO_NULL, ierr)
      call MPI_File_read_all(infile, myA, nelements, MPI_REAL, mpistatus, ierr)
      call MPI_File_close(infile,ierr)

      iotime = tock(clock)
      if (mpirank == 0) then
          print *,'I/O time      = ', iotime
      endif

! Initialize blacs processor grid

      call tick(clock)
      call blacs_pinfo   (me,procs)

      call blacs_get     (-1, 0, icontxt)
      call blacs_gridinit(icontxt, 'R', prow, pcol)
      call blacs_gridinfo(icontxt, prow, pcol, myrow, mycol)

      myArows = numroc(n, nb, myrow, 0, prow)
      myAcols = numroc(n, nb, mycol, 0, pcol)

! Construct local arrays
! Global structure:  matrix A of n rows and n columns

! Prepare array descriptors for ScaLAPACK 
      call descinit( ides_a, n, n, nb, nb, 0, 0, icontxt, myArows, info )
      call descinit( ides_z, n, n, nb, nb, 0, 0, icontxt, myArows, info )

! Call ScaLAPACK library routine

      allocate(work(1), iwork(1))
      iwork(1) = -1
      work(1)  = -1.
      call pssyevr( 'V', 'A', 'U', n, myA, 1, 1, ides_a, -1.e20, +1.e20, 1, n, &
                     m,  nz, eigenvals, myZ, 1, 1, ides_z, work, -1,           &
                     iwork, -1, info )
      worksize  = int(work(1))/2*3
      iworksize = iwork(1)/2*3
      print *, 'Local workspace ', worksize
      call MPI_Reduce(worksize, totwork, 1, MPI_INTEGER, MPI_SUM, 0, MPI_COMM_WORLD, ierr)
      if (mpirank == 0) print *, ' total work space ', totwork
      call MPI_Reduce(iworksize, totwork, 1, MPI_INTEGER, MPI_SUM, 0, MPI_COMM_WORLD, ierr)
      if (mpirank == 0) print *, ' total iwork space ', totwork
      deallocate(work,iwork)
      allocate(work(worksize),iwork(iworksize))
      call pssyev('N','U',n,myA,1,1,ides_a,eigenvals,myZ,1,1,ides_z,work,worksize,info)
      if (info /= 0) then
         print *, 'Error: info = ', info
      else if (mpirank == 0) then
         print *, 'Calculated ', m, ' eigenvalues and ', nz, ' eigenvectors.'
      endif

! Deallocate the local arrays

      deallocate(myA, myZ)
      deallocate(work, iwork)

! End blacs for processors that are used

      call blacs_gridexit(icontxt)
      calctime = tock(clock)

! calculated the expected eigenvalues for a particular test matrix

      p = 3. + sqrt((4. * n - 3.) * (n - 1.)*3./(n+1.))
      expected(1) = p/(n*(5.-2.*n))
      expected(2) = 6./(p*(n + 1.))
      expected(3:n) = 1.

! Print results

      if (me == 0) then
        if (info /= 0) then
             print *, 'Error -- info = ', info
        endif
        print *,'Eigenvalues L_infty err = ', &
          maxval(abs(eigenvals-expected))
        print *,'Compute time = ', calctime
      endif

      deallocate(eigenvals, expected)

      call MPI_Finalize(ierr)


contains
    subroutine tick(t)
        integer, intent(OUT) :: t

        call system_clock(t)
    end subroutine tick

    ! returns time in seconds from now to time described by t
    real function tock(t)
        integer, intent(in) :: t
        integer :: now, clock_rate

        call system_clock(now,clock_rate)

        tock = real(now - t)/real(clock_rate)
    end function tock

end program darray
like image 91
Jonathan Dursi Avatar answered Oct 23 '22 11:10

Jonathan Dursi