Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Segmentation faults occur when I run a parallel program with Open MPI

on my previous post I needed to distribute data of pgm files among 10 computers. With help from Jonathan Dursi and Shawn Chin, I have integrate the code. I can compile my program but it got segmentation fault. I ran but nothing happen

mpirun -np 10 ./exmpi_2 balloons.pgm output.pgm

The result is

[ubuntu:04803] *** Process received signal ***
[ubuntu:04803] Signal: Segmentation fault (11)
[ubuntu:04803] Signal code: Address not mapped (1)
[ubuntu:04803] Failing at address: 0x7548d0c
[ubuntu:04803] [ 0] [0x86b410]
[ubuntu:04803] [ 1] /lib/tls/i686/cmov/libc.so.6(fclose+0x1a0) [0x186b00]
[ubuntu:04803] [ 2] ./exmpi_2(main+0x78e) [0x80492c2]
[ubuntu:04803] [ 3] /lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe6) [0x141bd6]
[ubuntu:04803] [ 4] ./exmpi_2() [0x8048aa1]
[ubuntu:04803] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 4803 on node ubuntu exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Then i try run with valgrind to debug the program and the output.pgm is generated

valgrind mpirun -np 10 ./exmpi_2 balloons.pgm output.pgm

The result is

==4632== Memcheck, a memory error detector
==4632== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al.
==4632== Using Valgrind-3.6.0.SVN-Debian and LibVEX; rerun with -h for copyright info
==4632== Command: mpirun -np 10 ./exmpi_2 2.pgm 10.pgm
==4632==
==4632== Syscall param sched_setaffinity(mask) points to unaddressable byte(s)
==4632==    at 0x4215D37: syscall (syscall.S:31)
==4632==    by 0x402B335: opal_paffinity_linux_plpa_api_probe_init (plpa_api_probe.c:56)
==4632==    by 0x402B7CC: opal_paffinity_linux_plpa_init (plpa_runtime.c:37)
==4632==    by 0x402B93C: opal_paffinity_linux_plpa_have_topology_information (plpa_map.c:494)
==4632==    by 0x402B180: linux_module_init (paffinity_linux_module.c:119)
==4632==    by 0x40BE2C3: opal_paffinity_base_select (paffinity_base_select.c:64)
==4632==    by 0x40927AC: opal_init (opal_init.c:295)
==4632==    by 0x4046767: orte_init (orte_init.c:76)
==4632==    by 0x804A82E: orterun (orterun.c:540)
==4632==    by 0x804A3EE: main (main.c:13)
==4632==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==4632==
[ubuntu:04638] *** Process received signal ***
[ubuntu:04639] *** Process received signal ***
[ubuntu:04639] Signal: Segmentation fault (11)
[ubuntu:04639] Signal code: Address not mapped (1)
[ubuntu:04639] Failing at address: 0x7548d0c
[ubuntu:04639] [ 0] [0xc50410]  
[ubuntu:04639] [ 1] /lib/tls/i686/cmov/libc.so.6(fclose+0x1a0) [0xde4b00]
[ubuntu:04639] [ 2] ./exmpi_2(main+0x78e) [0x80492c2]
[ubuntu:04639] [ 3] /lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe6) [0xd9fbd6]
[ubuntu:04639] [ 4] ./exmpi_2() [0x8048aa1]
[ubuntu:04639] *** End of error message ***
[ubuntu:04640] *** Process received signal ***
[ubuntu:04640] Signal: Segmentation fault (11)
[ubuntu:04640] Signal code: Address not mapped (1)
[ubuntu:04640] Failing at address: 0x7548d0c
[ubuntu:04640] [ 0] [0xdad410]
[ubuntu:04640] [ 1] /lib/tls/i686/cmov/libc.so.6(fclose+0x1a0) [0xe76b00]
[ubuntu:04640] [ 2] ./exmpi_2(main+0x78e) [0x80492c2]
[ubuntu:04640] [ 3] /lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe6) [0xe31bd6]
[ubuntu:04640] [ 4] ./exmpi_2() [0x8048aa1]
[ubuntu:04640] *** End of error message ***
[ubuntu:04641] *** Process received signal ***
[ubuntu:04641] Signal: Segmentation fault (11)
[ubuntu:04641] Signal code: Address not mapped (1)
[ubuntu:04641] Failing at address: 0x7548d0c
[ubuntu:04641] [ 0] [0xe97410]
[ubuntu:04641] [ 1] /lib/tls/i686/cmov/libc.so.6(fclose+0x1a0) [0x1e8b00]
[ubuntu:04641] [ 2] ./exmpi_2(main+0x78e) [0x80492c2]
[ubuntu:04641] [ 3] /lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe6) [0x1a3bd6]
[ubuntu:04641] [ 4] ./exmpi_2() [0x8048aa1]
[ubuntu:04641] *** End of error message ***
[ubuntu:04642] *** Process received signal ***
[ubuntu:04642] Signal: Segmentation fault (11)
[ubuntu:04642] Signal code: Address not mapped (1)
[ubuntu:04642] Failing at address: 0x7548d0c
[ubuntu:04642] [ 0] [0x92d410]
[ubuntu:04642] [ 1] /lib/tls/i686/cmov/libc.so.6(fclose+0x1a0) [0x216b00]
[ubuntu:04642] [ 2] ./exmpi_2(main+0x78e) [0x80492c2]
[ubuntu:04642] [ 3] /lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe6) [0x1d1bd6]
[ubuntu:04642] [ 4] ./exmpi_2() [0x8048aa1]
[ubuntu:04642] *** End of error message ***
[ubuntu:04643] *** Process received signal ***
[ubuntu:04643] Signal: Segmentation fault (11)
[ubuntu:04643] Signal code: Address not mapped (1)
[ubuntu:04643] Failing at address: 0x7548d0c
[ubuntu:04643] [ 0] [0x8f4410]
[ubuntu:04643] [ 1] /lib/tls/i686/cmov/libc.so.6(fclose+0x1a0) [0x16bb00]
[ubuntu:04643] [ 2] ./exmpi_2(main+0x78e) [0x80492c2]
[ubuntu:04643] [ 3] /lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe6) [0x126bd6]
[ubuntu:04643] [ 4] ./exmpi_2() [0x8048aa1]
[ubuntu:04643] *** End of error message ***
[ubuntu:04638] Signal: Segmentation fault (11)
[ubuntu:04638] Signal code: Address not mapped (1)
[ubuntu:04638] Failing at address: 0x7548d0c
[ubuntu:04638] [ 0] [0x4f6410]
[ubuntu:04638] [ 1] /lib/tls/i686/cmov/libc.so.6(fclose+0x1a0) [0x222b00]
[ubuntu:04638] [ 2] ./exmpi_2(main+0x78e) [0x80492c2]
[ubuntu:04638] [ 3] /lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe6) [0x1ddbd6]
[ubuntu:04638] [ 4] ./exmpi_2() [0x8048aa1]
[ubuntu:04638] *** End of error message ***
[ubuntu:04644] *** Process received signal ***
[ubuntu:04644] Signal: Segmentation fault (11)
[ubuntu:04644] Signal code: Address not mapped (1)
[ubuntu:04644] Failing at address: 0x7548d0c
[ubuntu:04644] [ 0] [0x61f410]
[ubuntu:04644] [ 1] /lib/tls/i686/cmov/libc.so.6(fclose+0x1a0) [0x1a3b00]
[ubuntu:04644] [ 2] ./exmpi_2(main+0x78e) [0x80492c2]
[ubuntu:04644] [ 3] /lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe6) [0x15ebd6]
[ubuntu:04644] [ 4] ./exmpi_2() [0x8048aa1]
[ubuntu:04644] *** End of error message ***
[ubuntu:04645] *** Process received signal ***
[ubuntu:04645] Signal: Segmentation fault (11)
[ubuntu:04645] Signal code: Address not mapped (1)
[ubuntu:04645] Failing at address: 0x7548d0c
[ubuntu:04645] [ 0] [0x7a3410]
[ubuntu:04645] [ 1] /lib/tls/i686/cmov/libc.so.6(fclose+0x1a0) [0x1d5b00]
[ubuntu:04645] [ 2] ./exmpi_2(main+0x78e) [0x80492c2]
[ubuntu:04645] [ 3] /lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe6) [0x190bd6]
[ubuntu:04645] [ 4] ./exmpi_2() [0x8048aa1]
[ubuntu:04645] *** End of error message ***
[ubuntu:04647] *** Process received signal ***
[ubuntu:04647] Signal: Segmentation fault (11)
[ubuntu:04647] Signal code: Address not mapped (1)
[ubuntu:04647] Failing at address: 0x7548d0c
[ubuntu:04647] [ 0] [0xf54410]
[ubuntu:04647] [ 1] /lib/tls/i686/cmov/libc.so.6(fclose+0x1a0) [0x2bab00]
[ubuntu:04647] [ 2] ./exmpi_2(main+0x78e) [0x80492c2]
[ubuntu:04647] [ 3] /lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe6) [0x275bd6]
[ubuntu:04647] [ 4] ./exmpi_2() [0x8048aa1]
[ubuntu:04647] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 4639 on node ubuntu exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
6 total processes killed (some possibly by mpirun during cleanup)
==4632==
==4632== HEAP SUMMARY:
==4632==     in use at exit: 158,751 bytes in 1,635 blocks
==4632==   total heap usage: 10,443 allocs, 8,808 frees, 15,854,537 bytes allocated
==4632==
==4632== LEAK SUMMARY:
==4632==    definitely lost: 81,655 bytes in 112 blocks
==4632==    indirectly lost: 5,108 bytes in 91 blocks
==4632==      possibly lost: 1,043 bytes in 17 blocks
==4632==    still reachable: 70,945 bytes in 1,415 blocks 
==4632==         suppressed: 0 bytes in 0 blocks
==4632== Rerun with --leak-check=full to see details of leaked memory
==4632==
==4632== For counts of detected and suppressed errors, rerun with: -v
==4632== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 96 from 9)

Could someone help me to solve this problem. This is my source code

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "mpi.h"
#include <syscall.h>

#define SIZE_X 640
#define SIZE_Y 480




 int main(int argc, char **argv)
{
FILE *FR,*FW;
int ierr;
int rank, size;
int ncells;
int greys[SIZE_X][SIZE_Y];
int rows,cols, maxval;

int mystart, myend, myncells;
const int IONODE=0;
int *disps, *counts, *mydata;
int *data;
int i,j,temp1;
char dummy[50]="";





ierr = MPI_Init(&argc, &argv);
if (argc != 3) {
    fprintf(stderr,"Usage: %s infile outfile\n",argv[0]);
    fprintf(stderr,"outputs the negative of the input file.\n");
    return -1;
}            

ierr  = MPI_Comm_rank(MPI_COMM_WORLD, &rank);
ierr = MPI_Comm_size(MPI_COMM_WORLD, &size);
if (ierr) {
    fprintf(stderr,"Catastrophic MPI problem; exiting\n");
    MPI_Abort(MPI_COMM_WORLD,1);
}

    if (rank == IONODE) {
            //if (read_pgm(argv[1], &greys, &rows, &cols, &maxval)) {
            //   fprintf(stderr,"Could not read file; exiting\n");
              //   MPI_Abort(MPI_COMM_WORLD,2);

         rows=SIZE_X;
         cols=SIZE_Y;
         maxval=255;
         FR=fopen(argv[1], "r+");

         fgets(dummy,50,FR);
         do{  fgets(dummy,50,FR); } while(dummy[0]=='#');
         fgets(dummy,50,FR);

     for (j = 0; j <cols; j++)
     {
       for (i = 0; i <rows; i++)
       {
           fscanf(FR,"%d",&temp1);
         greys[i][j] = temp1;
       }
     }
}

    ncells = rows*cols;
    disps = (int *)malloc(size * sizeof(int));
    counts= (int *)malloc(size * sizeof(int));
    data = &(greys[0][0]); /* we know all the data is contiguous */

/* everyone calculate their number of cells */
ierr = MPI_Bcast(&ncells, 1, MPI_INT, IONODE, MPI_COMM_WORLD);
myncells = ncells/size;
mystart = rank*myncells;
myend   = mystart + myncells - 1;
if (rank == size-1) myend = ncells-1;
myncells = (myend-mystart)+1;
mydata = (int *)malloc(myncells * sizeof(int));

/* assemble the list of counts.  Might not be equal if don't divide evenly. */
ierr = MPI_Gather(&myncells, 1, MPI_INT, counts, 1, MPI_INT, IONODE, MPI_COMM_WORLD);
if (rank == IONODE) {
    disps[0] = 0;
    for (i=1; i<size; i++) {
        disps[i] = disps[i-1] + counts[i-1];
    }
}

/* scatter the data */
ierr = MPI_Scatterv(data, counts, disps, MPI_INT, mydata, myncells, MPI_INT, IONODE, MPI_COMM_WORLD);

/* everyone has to know maxval */
ierr = MPI_Bcast(&maxval, 1, MPI_INT, IONODE, MPI_COMM_WORLD);

for (i=0; i<myncells; i++)
    mydata[i] = maxval-mydata[i];

/* Gather the data */
ierr = MPI_Gatherv(mydata, myncells, MPI_INT, data, counts, disps, MPI_INT, IONODE, MPI_COMM_WORLD);

if (rank == IONODE)
{
//      write_pgm(argv[2], greys, rows, cols, maxval);
  FW=fopen(argv[2], "w");
  fprintf(FW,"P2\n%d %d\n255\n",rows,cols);    
  for(j=0;j<cols;j++)
    for(i=0;i<rows;i++)
   fprintf(FW,"%d ", greys[i][j]);
}

free(mydata);
if (rank == IONODE) {
    free(counts);
    free(disps);
    //free(&(greys[0][0]));
    //free(greys);

}
fclose(FR);
fclose(FW);
MPI_Finalize();
return 0;
}

This is the input image http://orion.math.iastate.edu/burkardt/data/pgm/balloons.pgm

like image 827
arep Avatar asked Mar 09 '11 14:03

arep


People also ask

What usually causes a segmentation fault?

Overview. A segmentation fault (aka segfault) is a common condition that causes programs to crash; they are often associated with a file named core . Segfaults are caused by a program trying to read or write an illegal memory location.

How can segmentation fault be avoided?

Use a #define or the sizeof operator at all places where the array length is used. Improper handling of NULL terminated strings. Forgetting to allocate space for the terminating NULL character. Forgetting to set the terminating NULL character.

What mistakes can cause a segmentation fault C++?

There are four common mistakes that lead to segmentation faults: dereferencing NULL, dereferencing an uninitialized pointer, dereferencing a pointer that has been freed (or deleted, in C++) or that has gone out of scope (in the case of arrays declared in functions), and writing off the end of an array.


1 Answers

Congratulations; the code almost ran completely perfectly, it died on almost the final lines of code.

The issue would have been a little clearer with valgrind, but you have to be trickier running valgrind with MPI -- or anything that involves a program launcher. Instead of:

valgrind mpirun -np 10 ./exmpi_2 balloons.pgm output.pgm

which does a valgrind of mpirun, which you don't really care about, you want to do

mpirun -np 10 valgrind ./exmpi_2 balloons.pgm output.pgm

-- that is, you want to launch 10 valgrinds, each running one process' worth of exmpi_2. If you do that (and you've compiled with -g), you'll find towards the end, valgrind output like the following:

==6303==  Access not within mapped region at address 0x1
==6303==    at 0x387FA60C17: fclose@@GLIBC_2.2.5 (in /lib64/libc-2.5.so)
==6303==    by 0x401222: main (pgm.c:124)

.. and that's all there is to it; you have all processes doing the fclose()s, when only one process has a handle to a fopen()ed file in the first place. Simply replacing

fclose(FR);
fclose(FW);

with

if (rank == IONODE) {
    fclose(FR);
    fclose(FW);
}

seems to work for me.

like image 109
Jonathan Dursi Avatar answered Oct 25 '22 01:10

Jonathan Dursi