Quantcast
Channel: Clusters and HPC Technology
Viewing all articles
Browse latest Browse all 930

trivial code fails sometimes under SGE: HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_ poll.c:70): assert (!(pollfds[i].rev

$
0
0

A trivial ring-passing .f90 program fails to start 50% of the time on our cluster (SGE 6.2u5). The same problem occurs with large codes:

The error message:

[mpiexec@compute-8-15.local] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:70): assert (!(pollfds[i].revents & ~POLLIN & ~POLLOUT & ~POLLHUP)) failed

[mpiexec@compute-8-15.local] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:511): error waiting for event

[mpiexec@compute-8-15.local] main (./ui/mpich/mpiexec.c:548): process manager error waiting for completion

The F90 code is  a 51-lines .f90 ring passing 1 integer from node 0->1->... (attached below), the test script is trivial

# /bin/csh
#
#$ -cwd -j y
#$ -o ring0f.np=72.log
#$ -pe orte_ib 72
#$ -q sTNi.q
#
echo + `date +'%Y.%m.%d %T'` started on host `hostname` in queue $QUEUE with jobid=$JOB_ID
echo using $NSLOTS slots on:
cat $PE_HOSTFILE
echo cwd=`pwd`

# ORTE implementation need this
setenv  OMPI_MCA_plm_rsh_disable_qrsh 1

# use Intel's mpirun
set MPICH = /software/intel/impi/4.0.3.008

# tell Intel's mpi implementation to use IB
setenv I_MPI_FABRICS "shm:ofa"

# now run ring0f
$MPICH/bin/mpirun -np $NSLOTS ./ring0f
echo = `date +'%Y.%m.%d %T'` done.

and the test works half the time. Successful output:

Warning: no access to tty (Bad file descriptor).

Thus no job control in this shell.

+ 2014.09.18 13:02:54 started on host compute-7-15.local in queue sTNi.q with jobid=1928756

using 72 slots on:

compute-7-15.local 3 sTNi.q@compute-7-15.local UNDEFINED

compute-7-23.local 11 sTNi.q@compute-7-23.local UNDEFINED

compute-7-16.local 9 sTNi.q@compute-7-16.local UNDEFINED

compute-9-4.local 18 sTNi.q@compute-9-4.local UNDEFINED

compute-7-7.local 3 sTNi.q@compute-7-7.local UNDEFINED

compute-8-4.local 10 sTNi.q@compute-8-4.local UNDEFINED

compute-8-20.local 2 sTNi.q@compute-8-20.local UNDEFINED

compute-8-14.local 2 sTNi.q@compute-8-14.local UNDEFINED

compute-10-4.local 14 sTNi.q@compute-10-4.local UNDEFINED

cwd=/home/hpc/tests/ib/mytests/ib/ifort

 Process            0  got          456  at pass           1

 Process            1  got          456  at pass           1

 Process            2  got          456  at pass           1

...

Process           71  got          456  at pass           1

= 2014.09.18 13:02:58 done.

I repeat the test and  I get:

Warning: no access to tty (Bad file descriptor).

Thus no job control in this shell.

+ 2014.09.18 16:23:24 started on host compute-8-15.local in queue sTNi.q with jobid=1939073

using 72 slots on:

compute-8-15.local 8 sTNi.q@compute-8-15.local UNDEFINED

compute-10-0.local 48 sTNi.q@compute-10-0.local UNDEFINED

compute-8-2.local 1 sTNi.q@compute-8-2.local UNDEFINED

compute-8-6.local 8 sTNi.q@compute-8-6.local UNDEFINED

compute-7-8.local 4 sTNi.q@compute-7-8.local UNDEFINED

compute-8-17.local 1 sTNi.q@compute-8-17.local UNDEFINED

compute-9-2.local 2 sTNi.q@compute-9-2.local UNDEFINED

cwd=/home/hpc/tests/ib/mytests/ib/ifort

[mpiexec@compute-8-15.local] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:70): assert (!(pollfds[i].revents & ~POLLIN & ~POLLOUT & ~POLLHUP)) failed

[mpiexec@compute-8-15.local] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:511): error waiting for event

[mpiexec@compute-8-15.local] main (./ui/mpich/mpiexec.c:548): process manager error waiting for completion

= 2014.09.18 16:23:25 done.

Any idea what causes this, or how to track the problem?

Sylvain Korzennik - HPC analyst, SI/HPC

(Smithsonian Institution High performance Computer)

PS: % uname -a

Linux hydra-2.si.edu 2.6.18-238.19.1.el5 #1 SMP Fri Jul 15 07:31:24 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux 

     % ifort -V

Intel(R) Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 12.0.2.137 Build 20110112

Copyright (C) 1985-2011 Intel Corporation.  All rights reserved.

program ring0f
  !  Ring.c-> MPI example from http://www-unix.mcs.anl.gov/mpi
  !  modified by SGK Sep 2014 to f90
  !  Write a program that takes data from process zero (0 to quit) and sends it
  !  to all of the other processes by sending it in a ring. That is, process i
  !  should receive the data and send it to process i+1, until the last process
  !  is reached.  Assume that the data consists of a single integer. Process zer
o
  !  reads the data from the user.
  !
  include 'mpif.h'
  integer nPass, iVal
  integer iErr, iStatus, iRank, iSize, iDest, iFrom, nCount
  dimension iVal(10)
  dimension iStatus(MPI_STATUS_SIZE)
  integer mpiComm, msgTag
  data nPass/1/, iVal/1234,9*0/
  !
  mpiComm = MPI_COMM_WORLD
  msgTag  = 0
  !
  call MPI_INIT(iErr)
  call MPI_COMM_RANK(mpiComm, iRank, iErr)
  call MPI_COMM_SIZE(mpiComm, iSize, iErr)
  !
  !print *, 'i>rank ', iRank, ' size ', iSize
  !
  iDest = iRank+1
  iFrom = iRank-1
  nCount = 1
  do iPass = 1, nPass
     if (iRank.eq.0) then
        iVal(1) = 456
        !print *, iRank, '->', iDest
        call MPI_SEND(iVal, nCount, MPI_INTEGER, iDest, msgTag, mpiComm, iErr)
     else
        !print *, iRank, '<-', iFrom
        call MPI_RECV(iVal, nCount, MPI_INTEGER, iFrom, msgTag, mpiComm, iStatus
, iErr)
        if (iDest.lt.iSize) then
        !print *, iRank, '->', iDest
           call MPI_SEND(iVal, nCount, MPI_INTEGER, iDest, msgTag, mpiComm, iErr
)
        end if
     end if
     !
     print *, 'Process ', iRank, ' got ', iVal(1), ' at pass', iPass
     !
  end do
  !
  !print *, 'e>rank ', iRank, ' size ', iSize
  call MPI_FINALIZE(iErr)
  !
end program ring0f

Makefile:

#

# Makefile for demos

#

# <- Last updated: Mon May 24 13:09:30 2010 -> SGK

#

# ---------------------------------------------------------------------------

#

# mpi location

MPDIR = /software/intel/impi/4.0.3.008

MPINC = $(MPDIR)/include64

MPLIB = $(MPDIR)/lib64

MPBIN = $(MPDIR)/bin64

#

# flags

CFLAGS = -O

FFLAGS = -O

MFLAGS = -I$(MPINC)

#

# compiler/linker

CC     = icc  $(CFLAGS) $(MFLAGS) $(IFLAGS)

MPICC  = $(MPBIN)/mpicc -cc=icc $(CFLAGS)

F90    = ifort  $(CFLAGS) $(MFLAGS) $(IFLAGS)

MPIF90 = $(MPBIN)/mpif90 -f90=ifort $(FFLAGS)

#

%.o: %.f90

        $(F90) -c $<

#

# ---------------------------------------------------------------------------

#

all: ring0 ring0f

#

ring0: ring0.o

        $(MPICC) -o $@ ring0.o

#

ring0f: ring0f.o

        $(MPIF90) -o $@ ring0f.o

#

# ---------------------------------------------------------------------------

#

clean:

        -rm *.o ring0 ring0f

cleaner:

        -rm *.log

 


Viewing all articles
Browse latest Browse all 930

Trending Articles