How to load balance a simple loop using MPI in C++

Tags:

I am writing some code which is computationally expensive, but highly parallelisable. Once parallelised, I intend to run it on a HPC, however to keep the runtime down to within a week, the problem needs to scale well, with the number of processors.

Below is a simple and ludicrous example of what I am attempting to achieve, which is concise enough to compile and demonstrate my problem;

Click to copy

#include <iostream>
#include <ctime>
#include "mpi.h"

using namespace std;

double int_theta(double E){
    double result = 0;
    for (int k = 0; k < 20000; k++)
        result += E*k;
    return result;
}

int main() 
{
    int n = 3500000;
    int counter = 0;
    time_t timer;
    int start_time = time(&timer);
    int myid, numprocs;
    int k;
    double integrate, result;
    double end = 0.5;
    double start = -2.;
    double E;
    double factor = (end - start)/(n*1.);
    integrate = 0;
    MPI_Init(NULL,NULL);
    MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
    MPI_Comm_rank(MPI_COMM_WORLD, &myid);
    for (k = myid; k<n+1; k+=numprocs){
        E = start + k*(end-start)/n;
        if (( k == 0 ) || (k == n))
            integrate += 0.5*factor*int_theta(E);
        else
            integrate += factor*int_theta(E);
        counter++;
    }
    cout<<"process "<<myid<<" took "<<time(&timer)-start_time<<"s"<<endl;
    cout<<"process "<<myid<<" performed "<<counter<<" computations"<<endl;
    MPI_Reduce(&integrate, &result, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
    if (myid == 0)
        cout<<result<<endl;
    MPI_Finalize();
    return 0;
}

I have compiled the problem on my quadcore laptop with

Click to copy

mpiicc test.cpp -std=c++14 -O3 -DMKL_LP64 -lmkl_intel_lp64 - lmkl_sequential -lmkl_core -lpthread -lm -ldl

and I get the following output;

Click to copy

$ mpirun -np 4 ./a.out
process 3 took 14s
process 3 performed 875000 computations
process 1 took 15s
process 1 performed 875000 computations
process 2 took 16s
process 2 performed 875000 computations
process 0 took 16s
process 0 performed 875001 computations
-3.74981e+08

$ mpirun -np 3 ./a.out 
process 2 took 11s
process 2 performed 1166667 computations
process 1 took 20s
process 1 performed 1166667 computations
process 0 took 20s
process 0 performed 1166667 computations
-3.74981e+08

$ mpirun -np 2 ./a.out 
process 0 took 16s
process 0 performed 1750001 computations
process 1 took 16s
process 1 performed 1750000 computations
-3.74981e+08

To me it appears that there must be a barrier somewhere that I am not aware of. I get better performance with 2 processors over 3. Please can somebody offer any advice? Thanks

932

asked Apr 07 '19 19:04

AlexD

1 Answers

If I read the output of lscpu you gave correctly (e.g. with the help of https://unix.stackexchange.com/a/218081), you are having 4 logical CPUs, but only 2 hardware cores (1 socket x 2 cores per socket). Using cat /proc/cpuinfo you can finde the make and model for the CPU to maybe find out more.

The four logical CPUs might result from hyperthreading, which means that some hardware resources (e.g. the FPU unit, but I am not an expert on this) are shared between two cores. Thus, I would not expect any good parallel scaling beyond two processes.

For scalability tests, you should try to get your hands on a machine with maybe 6 or more hardware cores do get a better estimate.

From looking at your code, I would expect perfect scalability to any number of cores - At least as long as you do not include the time needed for process startup and the final MPI_Reduce. These will for sure become slower with more processes involved.

157

answered Oct 20 '22 21:10

dasmy

Related questions
                            
                                Ambiguous resolution with template conversion operator
                            
                                template class wrapping a arbitrary type/non-type template class
                            
                                libc++: Why is the stream still good after closing
                            
                                Understanding the obscure template parameter of std::function
                            
                                Cmake not able to compile simple test program on qt creator / collect2: error: ld
                            
                                C++ template-based "override" equivalent when partial-specializing?
                            
                                Symbols are not loading on dynamic linking of wfdb library using node-gyp on macOS High Sierra
                            
                                Improve car plate recognition
                            
                                OS X: ld: library not found for -lstdc++
                            
                                Finding boost-python3 with Anaconda cmake prefix
                            
                                MSVC interpreting move-only struct argument as a pointer
                            
                                What does "a single total order" mean in std::notify_one()?
                            
                                Fastest way to find smallest missing integer from list of integers
                            
                                Why is template specialization of member functions not allowed?
                            
                                Problem with stateful lambda - Microsoft Compiler Version 19.16.27024.1
                            
                                How to organize a C ++ large project in Visual Studio
                            
                                Dll and static variables in template method in non template class
                            
                                C++ works fine at my computer but gets address sanitizer heap-buffer-overflow errors on leetcode
                            
                                Error reading variable. Cannot create a lazy string with address
                            
                                How to compute (negative binomial) distribution PDF and CDF in C++?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to load balance a simple loop using MPI in C++

Tags:

c++

mpi

intel-mpi

AlexD

People also ask

1 Answers

dasmy

Recent Activity

Donate For Us