Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to load balance a simple loop using MPI in C++

Tags:

c++

mpi

intel-mpi

I am writing some code which is computationally expensive, but highly parallelisable. Once parallelised, I intend to run it on a HPC, however to keep the runtime down to within a week, the problem needs to scale well, with the number of processors.

Below is a simple and ludicrous example of what I am attempting to achieve, which is concise enough to compile and demonstrate my problem;

#include <iostream>
#include <ctime>
#include "mpi.h"

using namespace std;

double int_theta(double E){
    double result = 0;
    for (int k = 0; k < 20000; k++)
        result += E*k;
    return result;
}

int main() 
{
    int n = 3500000;
    int counter = 0;
    time_t timer;
    int start_time = time(&timer);
    int myid, numprocs;
    int k;
    double integrate, result;
    double end = 0.5;
    double start = -2.;
    double E;
    double factor = (end - start)/(n*1.);
    integrate = 0;
    MPI_Init(NULL,NULL);
    MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
    MPI_Comm_rank(MPI_COMM_WORLD, &myid);
    for (k = myid; k<n+1; k+=numprocs){
        E = start + k*(end-start)/n;
        if (( k == 0 ) || (k == n))
            integrate += 0.5*factor*int_theta(E);
        else
            integrate += factor*int_theta(E);
        counter++;
    }
    cout<<"process "<<myid<<" took "<<time(&timer)-start_time<<"s"<<endl;
    cout<<"process "<<myid<<" performed "<<counter<<" computations"<<endl;
    MPI_Reduce(&integrate, &result, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
    if (myid == 0)
        cout<<result<<endl;
    MPI_Finalize();
    return 0;
}

I have compiled the problem on my quadcore laptop with

mpiicc test.cpp -std=c++14 -O3 -DMKL_LP64 -lmkl_intel_lp64 - lmkl_sequential -lmkl_core -lpthread -lm -ldl

and I get the following output;

$ mpirun -np 4 ./a.out
process 3 took 14s
process 3 performed 875000 computations
process 1 took 15s
process 1 performed 875000 computations
process 2 took 16s
process 2 performed 875000 computations
process 0 took 16s
process 0 performed 875001 computations
-3.74981e+08

$ mpirun -np 3 ./a.out 
process 2 took 11s
process 2 performed 1166667 computations
process 1 took 20s
process 1 performed 1166667 computations
process 0 took 20s
process 0 performed 1166667 computations
-3.74981e+08

$ mpirun -np 2 ./a.out 
process 0 took 16s
process 0 performed 1750001 computations
process 1 took 16s
process 1 performed 1750000 computations
-3.74981e+08

To me it appears that there must be a barrier somewhere that I am not aware of. I get better performance with 2 processors over 3. Please can somebody offer any advice? Thanks

like image 932
AlexD Avatar asked Apr 07 '19 19:04

AlexD


People also ask

What is MPI in C++?

Message Passing Interface (MPI) is a standard used to allow several different processors on a cluster to communicate with each other. In this tutorial we will be using the Intel C++ Compiler, GCC, IntelMPI, and OpenMPI to create a multiprocessor ‘hello world’ program in C++.

What is the basic structure of balancing loop?

The basic structure of a balancing loop involves a gap between the goal (or desired level) and the actual level. As the discrepancy between the two increases, corrective actions adjust the actual level until the gap decreases. In this sense, balancing processes always try to bring conditions into equilibrium.

How do you get the MPI environment in C++?

This function initializes the MPI environment. It takes in the addresses of the C++ command line arguments argc and argv. This function returns the total size of the environment via quantity of processes. The function takes in the MPI environment, and the memory address of an integer variable.

How do you know if a loop is balancing?

In causal loop diagrams, balancing loops are noted by a “B,” a “—,” or a scale icon in the center of a loop. There is always an inherent goal in a balancing process, whether it is visible or not.


1 Answers

If I read the output of lscpu you gave correctly (e.g. with the help of https://unix.stackexchange.com/a/218081), you are having 4 logical CPUs, but only 2 hardware cores (1 socket x 2 cores per socket). Using cat /proc/cpuinfo you can finde the make and model for the CPU to maybe find out more.

The four logical CPUs might result from hyperthreading, which means that some hardware resources (e.g. the FPU unit, but I am not an expert on this) are shared between two cores. Thus, I would not expect any good parallel scaling beyond two processes.

For scalability tests, you should try to get your hands on a machine with maybe 6 or more hardware cores do get a better estimate.

From looking at your code, I would expect perfect scalability to any number of cores - At least as long as you do not include the time needed for process startup and the final MPI_Reduce. These will for sure become slower with more processes involved.

like image 157
dasmy Avatar answered Oct 20 '22 21:10

dasmy