Quantcast
Channel: Clusters and HPC Technology
Viewing all 930 articles
Browse latest View live

problem with intel mpi 2019

$
0
0

When I compile test program with last beta version of intel mpi_2019  I receive error. Anybody have same problem?

$ mpiicc -o test test.c
ld: warning: libfabric.so.1, needed by /common/intel/compilers_and_libraries_2018.1.163/linux/mpi_2019/intel64/lib/release/libmpi.so, not found (try using -rpath or -rpath-link)
/common/intel/compilers_and_libraries_2018.1.163/linux/mpi_2019/intel64/lib/release/libmpi.so: undefined reference to `fi_strerror@FABRIC_1.0'
/common/intel/compilers_and_libraries_2018.1.163/linux/mpi_2019/intel64/lib/release/libmpi.so: undefined reference to `fi_tostr@FABRIC_1.0'
/common/intel/compilers_and_libraries_2018.1.163/linux/mpi_2019/intel64/lib/release/libmpi.so: undefined reference to `fi_fabric@FABRIC_1.1'
/common/intel/compilers_and_libraries_2018.1.163/linux/mpi_2019/intel64/lib/release/libmpi.so: undefined reference to `fi_dupinfo@FABRIC_1.1'
/common/intel/compilers_and_libraries_2018.1.163/linux/mpi_2019/intel64/lib/release/libmpi.so: undefined reference to `fi_getinfo@FABRIC_1.1'
/common/intel/compilers_and_libraries_2018.1.163/linux/mpi_2019/intel64/lib/release/libmpi.so: undefined reference to `fi_freeinfo@FABRIC_1.1'

$ type mpiicc
mpiicc is /common/intel/compilers_and_libraries_2018.1.163/linux/mpi_2019/intel64/bin/mpiicc

$ cat test.c
/*
    Copyright 2003-2017 Intel Corporation.  All Rights Reserved.

    The source code contained or described herein and all documents
    related to the source code ("Material") are owned by Intel Corporation
    or its suppliers or licensors.  Title to the Material remains with
    Intel Corporation or its suppliers and licensors.  The Material is
    protected by worldwide copyright and trade secret laws and treaty
    provisions.  No part of the Material may be used, copied, reproduced,
    modified, published, uploaded, posted, transmitted, distributed, or
    disclosed in any way without Intel's prior express written permission.

    No license under any patent, copyright, trade secret or other
    intellectual property right is granted to or conferred upon you by
    disclosure or delivery of the Materials, either expressly, by
    implication, inducement, estoppel or otherwise.  Any license under
    such intellectual property rights must be express and approved by
    Intel in writing.
*/
#include "mpi.h"
#include <stdio.h>
#include <string.h>

int
main (int argc, char *argv[])
{
    int i, rank, size, namelen;
    char name[MPI_MAX_PROCESSOR_NAME];
    MPI_Status stat;

    MPI_Init (&argc, &argv);

    MPI_Comm_size (MPI_COMM_WORLD, &size);
    MPI_Comm_rank (MPI_COMM_WORLD, &rank);
    MPI_Get_processor_name (name, &namelen);

    if (rank == 0) {

        printf ("Hello world: rank %d of %d running on %s\n", rank, size, name);

        for (i = 1; i < size; i++) {
            MPI_Recv (&rank, 1, MPI_INT, i, 1, MPI_COMM_WORLD, &stat);
            MPI_Recv (&size, 1, MPI_INT, i, 1, MPI_COMM_WORLD, &stat);
            MPI_Recv (&namelen, 1, MPI_INT, i, 1, MPI_COMM_WORLD, &stat);
            MPI_Recv (name, namelen + 1, MPI_CHAR, i, 1, MPI_COMM_WORLD, &stat);
            printf ("Hello world: rank %d of %d running on %s\n", rank, size, name);
        }

    } else {

        MPI_Send (&rank, 1, MPI_INT, 0, 1, MPI_COMM_WORLD);
        MPI_Send (&size, 1, MPI_INT, 0, 1, MPI_COMM_WORLD);
        MPI_Send (&namelen, 1, MPI_INT, 0, 1, MPI_COMM_WORLD);
        MPI_Send (name, namelen + 1, MPI_CHAR, 0, 1, MPI_COMM_WORLD);

    }

    MPI_Finalize ();

    return (0);
}

 

 


How to get the exit code from mpiexec.hydra

$
0
0

When running a workload on multiple nodes with mpiexec.hydra, the entire run aborts when even one node fails/shutsdown. I want to detect if the failure is due to node disconnection or something else. Trying to print out the exit code with "-print-all-exitcodes" does't seem to work

Is there any other option?

Python MPI4PY ISSUE

$
0
0

Hi All, 

 

    I'm HPC Admin. I have installed MPI4PY Library on clusters by .tar and Pip2.7(python2.7). After that, we are facing the issue like a 256cores job (n=4,ppn=64) is not running on nodes. It happened after installing MPI4PY(3.0). normal python code is running. 

 

Users unable to run jobs on Cluster like VASP,MPI4PY, mpi, openmpi, etc. 

 

The error is Given below:

 

[cli_0]: aborting job:

 

Fatal error in MPI_Init:

 

Other MPI error

 

 

 

[mpiexec@tyrone-node16] HYD_pmcd_pmiserv_send_signal (./pm/pmiserv/pmiserv_cb.c:184): assert (!closed) failed

 

[mpiexec@tyrone-node16] ui_cmd_cb (./pm/pmiserv/pmiserv_pmci.c:74): unable to send SIGUSR1 downstream

 

[mpiexec@tyrone-node16] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status

 

[mpiexec@tyrone-node16] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:181): error waiting for event

 

[mpiexec@tyrone-node16] main (./ui/mpich/mpiexec.c:405): process manager error waiting for completions

 

 

 

Kindly Help me.

 

 

 

Thanks in Advance!

 

 

 

Rahul Akolkar

OPA driver for Skylake running Ubuntu 16

$
0
0

Hi

Can you please refer me to the BKMs on getting Intel Omni-Path Fabric (OPA) driver installed and setup on Ubuntu 16? I have a Skylake processor (Gold 6148F CPU @ 2.40GHz). 

Thanks,

Dave 

ITAC error during trace collection

$
0
0

Hi all,

I am trying to collect tracing info for my Intel MPI job. For relatively small number of processes (around 300) the run hangs or I receive the following error message:

UCM connect: REQ RETRIES EXHAUSTED: 0x570 32c43 0xed -> 0x544 3f3a4 0xbbdd

How can I debug this error? 

Best,

Igor

Using Intel Trace analyzer with windows and VS2015

$
0
0

Hi,

I've been directed here from the Intel® Visual Fortran Compiler for Windows forum

I'm trying to use Intel Trace analyzer with windows and VS2015 on my CFD code.

In my code, I have many F90 files but different modules and subroutines. Also, I'm linking with other libraries such as PETSc.

I tried to add "/trace" in the additional commands in VS2015 GUI and compile. However, after running my code, there is no *.stf generated.

I then tried to do it in cygwin, on another smaller code, by adding -trace to the compiling and building. Similarly, no *.stf is generated.

I also tried to compile and build directory in the command prompt:

mpiifort /c /MT /Z7 /fpp /Ic:\wtay\Lib\petsc-3.8.3_win64_impi_vs2015_debug\include /trace   /o ex2f.obj ex2f.F

mpiifort /MT /trace /o ex2f.exe ex2f.obj /INCREMENTAL:NO /NOLOGO /qnoipo /LIBPATH:"C:\wtay\Lib\petsc-3.8.3_win64_impi_vs2015_debug\lib" Gdi32.lib User32.lib Advapi32.lib Kernel32.lib Ws2_32.lib impi.lib impid.lib impicxx.lib impicxxd.lib libpetsc.lib libflapack.lib libfblas.lib kernel32.lib

Now I can get the *.stf file.

But is there some way to do it in VS2015? As mentioned, I have many F90 files and I hope I do not have to use the command line.

Thanks.

Pinning processes to specific cores?

$
0
0

I'm wondering if Intel MPI has the facility to allow me to pin processes, not only to specific nodes within my cluster, but to specific cores within those nodes. With Open MPI, I can set up a rankfile that will give me this fine-grained capability, that is, I can assign each MPI process rank to a specific node and a given core on that node (logical or physical).

Granted, the rankfile idea from Open MPI is merely theoretical since the OS I have on the machines doesn't seem to abide by the assignments I make, but at least the possibility is there. I haven't found that level of control within the Intel MPI documentation. Any pointers or is this all just a pipe dream on my part?

Cross Platform MPI start failed

$
0
0

Hi Intel Engineers,

I met some problems when setting up a cross platform MPI environment.

Following the Intelmpi-2018-developer-guide-linux/windows,two machines were setted up, one is CentOS and anathor is windows server 2000.

The machine for Centos is 'mpihost1', and for Windows it is ‘iriphost1’.

SSH was configured correct. Input 'ssh root@mpihost1' could connected from windows to linux successfully.

However, when using command 'mpiexec -d -bootstrap ssh -hostos linux -host mpihost1 -n 1 hostname', an error 'bash: pmi_proxy: command not found' occurrd.

Is there any suggections?

Thanks

zhongqi

 

 

Here is the debug info:

C:\Windows\system32>mpiexec -d -bootstrap ssh -hostos linux -host mpihost1 -n 1 hostname
host: mpihost1

==================================================================================================
mpiexec options:
----------------
  Base path: C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2018.1.156\windows\mpi\intel64\b
in\
  Launcher: ssh
  Debug level: 1
  Enable X: -1

  Global environment:
  -------------------
    ALLUSERSPROFILE=C:\ProgramData
    APPDATA=C:\Users\root\AppData\Roaming
    CLIENTNAME=D1301002443
    CommonProgramFiles=C:\Program Files\Common Files
    CommonProgramFiles(x86)=C:\Program Files (x86)\Common Files
    CommonProgramW6432=C:\Program Files\Common Files
    COMPUTERNAME=IRIPHOST1
    ComSpec=C:\Windows\system32\cmd.exe
    CYGWIN=tty
    FP_NO_HOST_CHECK=NO
    HOME=F:\zzq\home\
    HOMEDRIVE=C:
    HOMEPATH=\Users\root
    INTEL_LICENSE_FILE=C:\Program Files (x86)\Common Files\Intel\Licenses
    I_MPI_ROOT=C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2018.1.156\windows\mpi
    LOCALAPPDATA=C:\Users\root\AppData\Local
    LOGONSERVER=\\IRIPHOST1
    NUMBER_OF_PROCESSORS=4
    OS=Windows_NT
    Path=C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2018.1.156\windows\mpi\intel64\bin;C
:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;E:\lxx
\scs;C:\Program Files\MySQL\MySQL Server 5.7\bin;C:\Program Files (x86)\Git\cmd;C:\Program Files (x86)\Gi
tExtensions\;F:\zzq\mpi\MinGW\msys\1.0\bin
    PATHEXT=.COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.MSC
    PROCESSOR_ARCHITECTURE=AMD64
    PROCESSOR_IDENTIFIER=Intel64 Family 6 Model 42 Stepping 7, GenuineIntel
    PROCESSOR_LEVEL=6
    PROCESSOR_REVISION=2a07
    ProgramData=C:\ProgramData
    ProgramFiles=C:\Program Files
    ProgramFiles(x86)=C:\Program Files (x86)
    ProgramW6432=C:\Program Files
    PROMPT=$P$G
    PSModulePath=C:\Windows\system32\WindowsPowerShell\v1.0\Modules\
    PUBLIC=C:\Users\Public
    SESSIONNAME=RDP-Tcp#0
    SystemDrive=C:
    SystemRoot=C:\Windows
    TEMP=C:\Users\root\AppData\Local\Temp\2
    TMP=C:\Users\root\AppData\Local\Temp\2
    USERDOMAIN=IRIPHOST1
    USERNAME=root
    USERPROFILE=C:\Users\root
    windir=C:\Windows

  Hydra internal environment:
  ---------------------------
    MPIR_CVAR_NEMESIS_ENABLE_CKPOINT=1
    GFORTRAN_UNBUFFERED_PRECONNECTED=y
    I_MPI_HYDRA_UUID=af00a0000-d366f2d3-34f29834-8a985d8a-

  Intel(R) MPI Library specific variables:
  ----------------------------------------
    I_MPI_ROOT=C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2018.1.156\windows\mpi
    I_MPI_HYDRA_UUID=af00a0000-d366f2d3-34f29834-8a985d8a-

    Proxy information:
    *********************
      [1] proxy: mpihost1 (1 cores)
      Exec list: hostname (1 processes);

==================================================================================================

[mpiexec@iriphost1] Timeout set to -1 (-1 means infinite)
[mpiexec@iriphost1] Got a control port string of iriphost1:60519

Proxy launch args: C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2018.1.156\windows\mpi\int
el64\bin\pmi_proxy --control-port iriphost1:60519 --debug --pmi-connect alltoall --pmi-aggregate -s 0 --r
mk user --launcher ssh --demux select --pgid 0 --enable-stdin 1 --retries 10 --control-code 9182 --usize
-2 --proxy-id

Arguments being passed to proxy 0:
--version 3.2 --iface-ip-env-name MPIR_CVAR_CH3_INTERFACE_HOSTNAME --hostname mpihost1 --global-core-map
0,1,1 --pmi-id-map 0,0 --global-process-count 1 --auto-cleanup 1 --pmi-kvsname kvs_2800_0 --pmi-process-m
apping (vector,(0,1,1)) --topolib ipl --ckpointlib blcr --ckpoint-prefix /tmp --ckpoint-preserve 1 --ckpo
int off --ckpoint-num -1 --global-inherited-env 41 'ALLUSERSPROFILE=C:\ProgramData''APPDATA=C:\Users\roo
t\AppData\Roaming''CLIENTNAME=D1301002443''CommonProgramFiles=C:\Program Files\Common Files''CommonPro
gramFiles(x86)=C:\Program Files (x86)\Common Files''CommonProgramW6432=C:\Program Files\Common Files''C
OMPUTERNAME=IRIPHOST1''ComSpec=C:\Windows\system32\cmd.exe''CYGWIN=tty''FP_NO_HOST_CHECK=NO''HOME=F:\
zzq\home\''HOMEDRIVE=C:''HOMEPATH=\Users\root''INTEL_LICENSE_FILE=C:\Program Files (x86)\Common Files\
Intel\Licenses''I_MPI_ROOT=C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2018.1.156\window
s\mpi''LOCALAPPDATA=C:\Users\root\AppData\Local''LOGONSERVER=\\IRIPHOST1''NUMBER_OF_PROCESSORS=4''OS=
Windows_NT''Path=C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2018.1.156\windows\mpi\inte
l64\bin;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.
0\;E:\lxx\scs;C:\Program Files\MySQL\MySQL Server 5.7\bin;C:\Program Files (x86)\Git\cmd;C:\Program Files
 (x86)\GitExtensions\;F:\zzq\mpi\MinGW\msys\1.0\bin''PATHEXT=.COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF
;.WSH;.MSC''PROCESSOR_ARCHITECTURE=AMD64''PROCESSOR_IDENTIFIER=Intel64 Family 6 Model 42 Stepping 7, Ge
nuineIntel''PROCESSOR_LEVEL=6''PROCESSOR_REVISION=2a07''ProgramData=C:\ProgramData''ProgramFiles=C:\P
rogram Files''ProgramFiles(x86)=C:\Program Files (x86)''ProgramW6432=C:\Program Files''PROMPT=$P$G''P
SModulePath=C:\Windows\system32\WindowsPowerShell\v1.0\Modules\''PUBLIC=C:\Users\Public''SESSIONNAME=RD
P-Tcp#0''SystemDrive=C:''SystemRoot=C:\Windows''TEMP=C:\Users\root\AppData\Local\Temp\2''TMP=C:\Users
\root\AppData\Local\Temp\2''USERDOMAIN=IRIPHOST1''USERNAME=root''USERPROFILE=C:\Users\root''windir=C:
\Windows' --global-user-env 0 --global-system-env 3 'MPIR_CVAR_NEMESIS_ENABLE_CKPOINT=1''GFORTRAN_UNBUFF
ERED_PRECONNECTED=y''I_MPI_HYDRA_UUID=af00a0000-d366f2d3-34f29834-8a985d8a-' --proxy-core-count 1 --mpi-
cmd-env C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2018.1.156\windows\mpi\intel64\bin\mp
iexec.exe -d -bootstrap ssh -hostos linux -host mpihost1 -n 1 hostname  --exec --exec-appnum 0 --exec-pro
c-count 1 --exec-local-env 0 --exec-wdir C:\Windows\system32 --exec-args 1 hostname

[mpiexec@iriphost1] Launch arguments: F:\zzq\mpi\MinGW\msys\1.0\bin/ssh.exe -x -q mpihost1 pmi_proxy --co
ntrol-port iriphost1:60519 --debug --pmi-connect alltoall --pmi-aggregate -s 0 --rmk user --launcher ssh
--demux select --pgid 0 --enable-stdin 1 --retries 10 --control-code 9182 --usize -2 --proxy-id 0
[mpiexec@iriphost1] STDIN will be redirected to 1 fd(s): 4
bash: pmi_proxy: command not found


Unable to generate trace file (*.stf)

$
0
0

Hello all,

I tried to generate trace file for performance profiling the code. I am testing in stampede2 with the module lists

Currently Loaded Modules:
  1) git/2.9.0       3) xalt/1.7.7   5) intel/17.0.4   7) python/2.7.13   9) hdf5/1.8.16  11) petsc/3.7
  2) autotools/1.1   4) TACC         6) impi/17.0.3    8) gsl/2.3        10) papi/5.5.1   12) itac/17.0.3

ITAC help shows how I can generate the trace file from certain simulation. Below is the configuration result from the code

# Whenever this version string changes, the application is configured
# and rebuilt from scratch
VERSION = stampede2-2017-10-03

CPP = cpp
FPP = cpp
CC  = mpicc
CXX = mpicxx
F77 = ifort
F90 = ifort

CPPFLAGS = -g -trace -D_XOPEN_SOURCE -D_XOPEN_SOURCE_EXTENDED
FPPFLAGS = -g -trace -traditional
CFLAGS   = -g -trace -traceback -debug all -xCORE-AVX2 -axMIC-AVX512 -align -std=gnu99
CXXFLAGS = -g -trace -traceback -debug all -xCORE-AVX2 -axMIC-AVX512 -align -std=gnu++11
F77FLAGS = -g -trace -traceback -debug all -xCORE-AVX2 -axMIC-AVX512 -align -pad -safe-cray-ptr
F90FLAGS = -g -trace -traceback -debug all -xCORE-AVX2 -axMIC-AVX512 -align -pad -safe-cray-ptr

LDFLAGS = -rdynamic -xCORE-AVX2 -axMIC-AVX512

C_LINE_DIRECTIVES = yes
F_LINE_DIRECTIVES = yes

VECTORISE                = yes
VECTORISE_ALIGNED_ARRAYS = no
VECTORISE_INLINE         = no

DEBUG = no
CPP_DEBUG_FLAGS = -DCARPET_DEBUG
FPP_DEBUG_FLAGS = -DCARPET_DEBUG
C_DEBUG_FLAGS   = -O0
CXX_DEBUG_FLAGS = -O0
F77_DEBUG_FLAGS = -O0 -check bounds -check format
F90_DEBUG_FLAGS = -O0 -check bounds -check format

OPTIMISE = yes
CPP_OPTIMISE_FLAGS = # -DCARPET_OPTIMISE -DNDEBUG
FPP_OPTIMISE_FLAGS = # -DCARPET_OPTIMISE -DNDEBUG
C_OPTIMISE_FLAGS   = -Ofast
CXX_OPTIMISE_FLAGS = -Ofast
F77_OPTIMISE_FLAGS = -Ofast
F90_OPTIMISE_FLAGS = -Ofast

CPP_NO_OPTIMISE_FLAGS  =
FPP_NO_OPTIMISE_FLAGS  =
C_NO_OPTIMISE_FLAGS    = -O0
CXX_NO_OPTIMISE_FLAGS  = -O0
CUCC_NO_OPTIMISE_FLAGS =
F77_NO_OPTIMISE_FLAGS  = -O0
F90_NO_OPTIMISE_FLAGS  = -O0

PROFILE = no
CPP_PROFILE_FLAGS =
FPP_PROFILE_FLAGS =
C_PROFILE_FLAGS   = -pg
CXX_PROFILE_FLAGS = -pg
F77_PROFILE_FLAGS = -pg
F90_PROFILE_FLAGS = -pg

OPENMP           = yes
CPP_OPENMP_FLAGS = -fopenmp
FPP_OPENMP_FLAGS = -fopenmp
C_OPENMP_FLAGS   = -fopenmp
CXX_OPENMP_FLAGS = -fopenmp
F77_OPENMP_FLAGS = -fopenmp
F90_OPENMP_FLAGS = -fopenmp

WARN           = yes
CPP_WARN_FLAGS =
FPP_WARN_FLAGS =
C_WARN_FLAGS   =
CXX_WARN_FLAGS =
F77_WARN_FLAGS =
F90_WARN_FLAGS =

BLAS_DIR  = NO_BUILD
BLAS_LIBS = -mkl

HWLOC_DIR        = NO_BUILD
HWLOC_EXTRA_LIBS = numa

LAPACK_DIR  = NO_BUILD
LAPACK_LIBS = -mkl

OPENBLAS_DIR  = NO_BUILD
OPENBLAS_LIBS = -mkl

HDF5_DIR = /opt/apps/intel17/hdf5/1.8.16/x86_64

BOOST_DIR = /opt/apps/intel17/boost/1.64

GSL_DIR = /opt/apps/intel17/gsl/2.3

FFTW3_DIR = NO_BUILD
FFTW3_INC_DIRS = /opt/intel/compilers_and_libraries_2017.4.196/linux/mkl/include/fftw
FFTW3_LIBS = -mkl

PAPI_DIR = /opt/apps/papi/5.5.1

PETSC_DIR = /home1/apps/intel17/impi17_0/petsc/3.7/knightslanding
PETSC_LAPACK_EXTRA_LIBS = -mkl

PTHREADS_DIR = NO_BUILD

I am using mpicc/mpicxx with -trace flag to enable to use ITAC. Then, I use bellow to submit the job. First, runscript is generated

#! /bin/bash

echo "Preparing:"
set -x                          # Output commands
set -e                          # Abort on errors

cd @RUNDIR@-active

module unload mvapich2
module load impi/17.0.3
module list

echo "Checking:"
pwd
hostname
date

echo "Environment:"
#export I_MPI_FABRICS=shm:ofa
#export I_MPI_MIC=1
#export I_MPI_OFA_ADAPTER_NAME=mlx4_0
export CACTUS_NUM_PROCS=@NUM_PROCS@
export CACTUS_NUM_THREADS=@NUM_THREADS@
export CACTUS_SET_THREAD_BINDINGS=1
export CXX_MAX_TASKS=500
export GMON_OUT_PREFIX=gmon.out
export OMP_MAX_TASKS=500
export OMP_NUM_THREADS=@NUM_THREADS@
export OMP_STACKSIZE=8192       # kByte
export PTHREAD_MAX_TASKS=500
env | sort > SIMFACTORY/ENVIRONMENT
echo ${SLURM_NODELIST} > NODES

echo "Starting:"
export CACTUS_STARTTIME=$(date +%s)
export VT_PCTRACE=1
time ibrun -trace @EXECUTABLE@ -L 3 @PARFILE@

echo "Stopping:"
date

echo "Done."

As you can see above, I use iburn -trace. Then below is the submit script

#! /bin/bash
#SBATCH -A @ALLOCATION@
#SBATCH -p @QUEUE@
#SBATCH -t @WALLTIME@
#SBATCH -N @NODES@ -n @NUM_PROCS@
#SBATCH @("@CHAINED_JOB_ID@" != "" ? "-d afterany:@CHAINED_JOB_ID@" : "")@
#SBATCH -J @SHORT_SIMULATION_NAME@
#SBATCH --mail-type=ALL
#SBATCH --mail-user=@EMAIL@
#SBATCH -o @RUNDIR@/@SIMULATION_NAME@.out
#SBATCH -e @RUNDIR@/@SIMULATION_NAME@.err
cd @SOURCEDIR@
@SIMFACTORY@ run @SIMULATION_NAME@ --machine=@MACHINE@ --restart-id=@RESTART_ID@ @FROM_RESTART_COMMAND@

I guess that's what I need to generate trace file. I sent several simple jobs to check it but I didn't get any file after simulation. Simulation was completed without the problem so I am stuck now.

Does anyone have idea about this? The code I would like to check is called einstein-toolkit

What's the expected slowdown for -gdb on MPI app?

$
0
0

I'm encountering a repeatable memory error that goes away as I increase the number of processes. I'm thinking that there is some static allocation or other memory limit that is being hit, but having more processes spreads the needed memory for each process to eventually fit into that limit. So, I wanted to use GDB to track down where there memory error is cropping up in order to fix the code. (The overall use of memory is only in the single digit percents of what's available when the code cracks.)

Without the '-gdb' option, I can run an instance of the code in just over 1 second. If I add the debugger flag, after I type "run" at the (mpigdb) prompt, I wait and wait and wait. Looking at 'top' in another window I see the mpiexec.hydra process pop up with 0.3% of CPU every once in a while. For example, 

[clay@XXX src]$ time mpiexec -n 2 graph500_reference_bfs 15

real    0m1.313s
user    0m2.255s
sys     0m0.345s
[clay@XXX src]$ mpiexec -gdb -n 2 graph500_reference_bfs 15
mpigdb: np = 2
mpigdb: attaching to 1988 graph500_reference_bfs qc-2.oda-internal.com
mpigdb: attaching to 1989 graph500_reference_bfs qc-2.oda-internal.com
[0,1] (mpigdb) run
[0,1]   Continuing.
^Cmpigdb: ending..
[mpiexec@XXX] Sending Ctrl-C to processes as requested
[mpiexec@XXX] Press Ctrl-C again to force abort
[clay@XXX src]$

Do I need to just be more patient? If the real problem test case takes almost 500 seconds to reach the error point, how patient do I need to be? Or is there something else I need to be doing different to get things to execute in a timely manner? (I've tried to attach to one of the running process, but that didn't work at all.)

I was hoping to not need to resort to the most common debugger, the 'printf' statement, if I could help it. And using a debugger would elevate my skills in the eyes of management for me. :-)

Thanks.

--clay 

Benchmark With Broadwell

$
0
0

Hi Team,

Need help to achieve the optimal result:

E5-2697 v4 @ 2.30GHz  AVX 2.00 GHz

2.3 * 36 * 16 = 1324  ( TDP )
2.0 * 36 * 16  = 1152  ( AVX )

Intel(R) MPI Library for Linux* OS, Version 2017 Update 3 Build 20170405 (id: 17193)
Linux master.local 3.10.0-693.5.2.el7.x86_64
CentOS Linux release 7.4.1708 (Core)

Two Node Result 

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4      231168   192     8     9            7936.44            1.03770e+03

mpirun  -print-rank-map -np  72  -genv I_MPI_DEBUG 5 -genv I_MPI_FALLBACK_DEVICE 0 -genv I_MPI_FABRICS shm:dapl --machinefile $PBS_NODEFILE  /opt/apps/intel/mkl/benchmarks/mp_linpack/xhpl_intel64_static

Single node Performance
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4      163200   192     6     6            4123.17            7.02820e+02

 

Need your support

 

Thank You

 

Trace Collector with ILP64 MKL and MPI libraries

$
0
0

Hi,

Is it possible to use the Intel Trace Collector (on linux) with the ILP64 MKL and MPI libraries?  I see on the MPI page

          https://software.intel.com/en-us/mpi-developer-reference-linux-ilp64-sup...

the statement

"If you want to use the Intel® Trace Collector with the Intel MPI Library ILP64 executable files, you must use a special Intel Trace Collector library. If necessary, the mpiifort compiler wrapper will select the correct Intel Trace Collector library automatically."

I don't really understand whether this means 1) there are special Trace Collector Libraries available or 2) you somehow have to generate your own special library.  I can find no information in the Trace Collector documentation itself concerning ILP64 support.

Thanks,

John

Intel MPI segmentation fault bug

$
0
0

 Hi

   I have come across a bug in Intel MPI when testing in a docker container with no numa support. It appears that the case of no numa support is not being handled correctly.  More details below

 Thanks

  Jamil

    icc --version
    icc (ICC) 17.0.6 20171215

gcc --version
gcc (GCC) 5.3.1 20160406 (Red Hat 5.3.1-6)

     uname -a
     Linux centos7dev 4.9.60-linuxkit-aufs #1 SMP Mon Nov 6 16:00:12 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux 

     bug.c

 #include "mpi.h"

int main (int argc, char *argv[])
{
   MPI_Init(&argc,&argv);
}

I_MPI_CC=gcc mpicc -g bug.c -o bug

gdb ./bug

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7b64f45 in __I_MPI___intel_sse2_strtok () from /opt/intel/compilers_and_libraries_2017.6.256/linux/mpi/intel64/lib/libmpifort.so.12
Missing separate debuginfos, use: debuginfo-install libgcc-4.8.5-16.el7_4.2.x86_64 numactl-devel-2.0.9-6.el7_2.x86_64
(gdb) bt
#0  0x00007ffff7b64f45 in __I_MPI___intel_sse2_strtok () from /opt/intel/compilers_and_libraries_2017.6.256/linux/mpi/intel64/lib/libmpifort.so.12
#1  0x00007ffff70acab1 in MPID_nem_impi_create_numa_nodes_map () at ../../src/mpid/ch3/src/mpid_init.c:1355
#2  0x00007ffff70ad994 in MPID_Init (argc=0x1, argv=0x7ffff72a2268, requested=-148233624, provided=0x1, has_args=0x0, has_env=0x0)
    at ../../src/mpid/ch3/src/mpid_init.c:1733
#3  0x00007ffff7043ebb in MPIR_Init_thread (argc=0x1, argv=0x7ffff72a2268, required=-148233624, provided=0x1) at ../../src/mpi/init/initthread.c:717
#4  0x00007ffff70315bb in PMPI_Init (argc=0x1, argv=0x7ffff72a2268) at ../../src/mpi/init/init.c:253
#5  0x00000000004007e8 in main (argc=1, argv=0x7fffffffcd58) at bug.c:6

MPI stat

$
0
0

Hi

I want to generate timing log on mpi functions. I am using "export I_MPI_STATS=20" to enable log. This is capturing timing info only on one node. How to get similar information from all nodes that are used in execution.

Thanks

Biren

 

How to use Intel MPI to create system resources such as Opengl windows, system shared memory on Windows 10?

$
0
0

Our school project needs MPI, OpenGL, but in our attempt, we failed to create OpenGL window and system shared memory in Intel MPI process. Could anyone help us.

Our os is Windows 10.


Issue with MPI_ALLREDUCE with MPI_REAL16

$
0
0

Hello!

I am running a quad precision code in MPI. However, when I perform MPI_ALLREDUCE with MPI_REAL16 as datatype, the code gives a segmentation fault. How do I incorporate quad precision reduction operations in MPI. Any advice would be greatly appreciated.

Regards

Suman Vajjala

PETSc 3.8 build: internal error: 0_76

$
0
0

I'm attempting a PETSc 3.8 build with Intel Parallel Studio 2017.0.5. The build fails without much information, but it appears to be an internal compiler error.

Some key output:

...
/home/cchang/Packages/petsc-3.8/src/vec/is/sf/impls/basic/sfbasic.c(528): (col. 1) remark: FetchAndInsert__blocktype_int_4_1 has been targeted for automatic cpu dispatch
": internal error: 0_76
compilation aborted for /home/cchang/Packages/petsc-3.8/src/vec/is/sf/impls/basic/sfbasic.c (code 4)
gmake[2]: *** [impi-intel/obj/src/vec/is/sf/impls/basic/sfbasic.o] Error 4

Could you tell me what this error 0_76 is? I can provide log files or environment info if these will help.

Thanks,

Chris

execvp error

$
0
0

Gentlemen, could you please help with an issue?

I´m using intel compiler ifort version 18.0.2 and intel mpi version 2018.2.199 in a attempt to run WRF model on HPE (former SGI) ICE X machine.

wrfoperador@dpns31:~> ifort -v
ifort version 18.0.2

wrfoperador@dpns31:~> mpirun -V
Intel(R) MPI Library for Linux* OS, Version 2018 Update 2 Build 20180125 (id: 18157)
Copyright 2003-2018 Intel Corporation.
wrfoperador@dpns31:~>

 

When I run the executable I receive the following message:

/opt/intel/intel_2018/compilers_and_libraries_2018.2.199/linux/mpi/intel64/bin/mpirun r1i1n0 12 /home/wrfoperador/wrf/wrf_metarea5/WPS/geogrid.exe
[proxy:0:0@dpns31] HYDU_create_process (../../utils/launch/launch.c:825): execvp error on file r1i1n0 (No such file or directory)
[proxy:0:0@dpns31] HYDU_create_process (../../utils/launch/launch.c:825): execvp error on file r1i1n0 (No such file or directory)
[proxy:0:0@dpns31] HYDU_create_process (../../utils/launch/launch.c:825): execvp error on file r1i1n0 (No such file or directory)
[proxy:0:0@dpns31] HYDU_create_process (../../utils/launch/launch.c:825): execvp error on file r1i1n0 (No such file or directory)
[proxy:0:0@dpns31] HYDU_create_process (../../utils/launch/launch.c:825): execvp error on file r1i1n0 (No such file or directory)
[proxy:0:0@dpns31] HYDU_create_process (../../utils/launch/launch.c:825): execvp error on file r1i1n0 (No such file or directory)
[proxy:0:0@dpns31] HYDU_create_process (../../utils/launch/launch.c:825): execvp error on file r1i1n0 (No such file or directory)
[proxy:0:0@dpns31] HYDU_create_process (../../utils/launch/launch.c:825): execvp error on file r1i1n0 (No such file or directory)
[proxy:0:0@dpns31] HYDU_create_process (../../utils/launch/launch.c:825): execvp error on file r1i1n0 (No such file or directory)
[proxy:0:0@dpns31] HYDU_create_process (../../utils/launch/launch.c:825): execvp error on file r1i1n0 (No such file or directory)
[proxy:0:0@dpns31] HYDU_create_process (../../utils/launch/launch.c:825): execvp error on file r1i1n0 (No such file or directory)
[proxy:0:0@dpns31] HYDU_create_process (../../utils/launch/launch.c:825): execvp error on file r1i1n0 (No such file or directory)

Could you help me to solve this problem?

 

Thanks for your attention. I´m looking forward to your reply.

Help With Very Slow Intel MPI Performance

$
0
0

All,

I'm hoping the Intel MPI gurus can help with this. Recently I've tried transitioning some code I help maintain (GEOS, a climate model) from using HPE MPT (2.17, in this case) to Intel MPI (18.0.1; 18.0.2 I'll test soon).  In both cases, the compiler (Intel 18.0.1) is the same, both running on the same set of Haswell nodes on an SGI/HPE cluster. The only difference is the MPI stack.

Now one part of the code (AGCM, the physics/dynamics part) is actually a little bit faster with Intel MPI than MPT, even on an SGI machine. That is nice. It's maybe 5-10% faster in some cases. Huzzah!

But, another code (GSI, analysis of observation data) really, really, really does not like Intel MPI. This code displays two issues. First, after the code starts (both launch very fast) it eventually hits a point at which, we believe, the first collective occurs at which point the whole code stalls as it...initializes buffers? Something with Infiniband maybe? We don't know. MPT slows a bit too, but doesn't show this issue nearly as badly as IMPI. We had another place like this in the AGCM where moving from a collective to an Isend/Recv/Wait type paradigm really helped. This "stall" is annoying and, worse, it gets longer and longer as the number of cores increase. (We might have a reproducer for this one.)

But, that is minor really. A minute or so, compared to the overall performance. On 240 cores, MPT 2.17 runs this code in 15:03 (minutes:seconds), Intel MPI 18.0.1, 28:12. On 672 cores, MPT 2.17 runs the code in 12:02 and Intel MPI 18.0.2 in 21:47; doesn't scale well overall for either.

Using I_MPI_STATS, the code is seen to be ~60% MPI in Alltoallv (20% of wall) at 240 cores; at 672, Barrier starts to win, but Alltoallv is still 40% MPI, 23% walltime. I've tried running by setting both I_MPI_ADJUST_ALLTOALLV options (1 and 2) and it does little at all (28:44 and 28:25 at 240).

I'm going to try and see if I can request/reserve a set of nodes for a long time to do an mpitune run, but since each run is ~30 minutes...mpitune will not be fun as it'd be 90 minutes for each option test.

Any ideas on what might be happening? Any advice for flags/environment variables to try? I understand that HPE MPT might/should work best on an SGI/HPE machine (like how Intel compilers seem to do best with Intel chips), but this seems a bit beyond the usual difference. I've requested MVAPICH2 be installed as well for another comparison.

Matt

Problem with NFS Over RDMA on OmniPath

$
0
0

I have been trying to setup NFS over RDMA on OmniPath following instruction in official document. IPoIB works fine, but I cannot get NFS over RDMA working. I have modified /etc/rdma/rdma.conf and added

NFSoRDMA_LOAD=yes
NFSoRDMA_PORT=2050

I have also loaded appropriate modules ( sunrpc on the client, xprtrdma on the server). However, the NFS mount fails (connection refused) when using RDMA to mount, note that it works fine if I do not specify rdma.

It appears that the 2050 port for NFSoRDMA does not get created, when I do rpcinfo from the client to examine the server I see ports 2049 for nfs, but nothing for 2050.

This is on CentOS 7.4. Any ideas/suggestions what may be wrong?

Viewing all 930 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>