Quantcast
Channel: Clusters and HPC Technology
Viewing all 930 articles
Browse latest View live

How to select TCP/IP as fabric at runtime with Intel 2019 MPI

$
0
0

Hello,

My apologies: I posted this earlier within another thread, but afterwards decided to submit it as a new query.

I have been struggling for a couple days to figure out the very basic setting of how to correctly instantiate Intel MPI 2019 for use over sockets/TCP. I am able to source mpivars.sh without any parameters and then export FI_PROVIDER=sockets which allows me to compile and run the simple hello world code found all over the place on a single node with n number ranks. However, when I instantiate my environment  in the same way and try to compile PAPI from source, it complains in the configure step that the C compiler (GCC in this case) is not able to create executables. The config.log reveals that it struggles to find libfabric.so.1. Even if I add the libfabrics directory to my LD_LIBRARY_PATH and link to the  libfabrics library, I am not able to build PAPI from source. Additionally, I cannot find good documentation for how to use MPI in the most simple and  basic way - single node and several processes. There is a graphic on several presentations and even software.intel.com/intel-mpi-library which indicates I will be able to choose TCP/IP, among other fabric options, at runtime. I will appreciate your comments and assistance in letting me know the correct way to do this.

Regards,

-Rashawn


Intel mpirun error - AI workload

$
0
0

Hi,

  I tried to run one of my workload model for training on a CentOs cluster for MPI analysis. Please find below the command used and the error is displayed below. Request your help in resolving the issue. 

Commands used 

mpiexec  –ppn 1 -- ./scripts/run_intelcaffe.sh --hostfile ~/mpd.hosts --solver models/intel_optimized_models/multinode/resnet50_8nodes_2s/solver.prototxt --network tcp --netmask enp175s0 --benchmark mpi

mpirun  –ppn 1 –l amplxe-cl -collect hotspots -k sampling-mode=hw -result-dir results -- ./scripts/run_intelcaffe.sh --hostfile ~/mpd.hosts --solver models/intel_optimized_models/multinode/resnet50_8nodes_2s/solver.prototxt --network tcp --netmask enp175s0 --benchmark mpi

I keep getting the following error. 

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 26 PID 72362 RUNNING AT node001
=   EXIT STATUS: 255
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 27 PID 72363 RUNNING AT node001
=   KILLED BY SIGNAL: 9 (Killed)
 

I_MPI_STATS removed

$
0
0

Hi all,

 

I just learned that gathering statistics using I_MPI_STATS is not supported by version 2019 of Intel MPI (https://software.intel.com/en-us/articles/intel-mpi-library-2019-beta-re..., "Removals"). 

 

I found this feature quite useful. Is there now a different way to gather MPI statistics? Will this removal be permanent or might the environment variable be re-introduced in a later version? If none is the case, what is the motivation for removing it?

 

Best,

Christian

AWS, Intel mpi and efa

$
0
0

Im having some issues with the latest Intel mpi and efa on AWS instances.

I installed the Intel MPI from the install script found elsewhere in the support forums.

(https://software.intel.com/sites/default/files/managed/f4/92/install_imp...)

I grabbed the latest libfabric source and built that.

The instance already had AWS's libfabric from the efa setup install but its not in PATH/LD_LIBRARY_PATH for these tests.

[sheistan@compute-041249 ~]$ which mpiexec
/opt/intel/compilers_and_libraries_2019.4.243/linux/mpi/intel64/bin/mpiexec
[sheistan@compute-041249 ~]$ which fi_info
/nasa/libfabric/latest/bin/fi_info
[sheistan@compute-041249 ~]$ fi_info -p efa
provider: efa
    fabric: EFA-fe80::4c2:2aff:fec7:ce80
    domain: efa_0-rdm
    version: 2.0
    type: FI_EP_RDM
    protocol: FI_PROTO_EFA
provider: efa
    fabric: EFA-fe80::4c2:2aff:fec7:ce80
    domain: efa_0-dgrm
    version: 2.0
    type: FI_EP_DGRAM
    protocol: FI_PROTO_EFA
provider: efa;ofi_rxd
    fabric: EFA-fe80::4c2:2aff:fec7:ce80
    domain: efa_0-dgrm
    version: 1.0
    type: FI_EP_RDM
    protocol: FI_PROTO_RXD
 

when running on a single node life is good as expected:

[sheistan@compute-041249 ~]$ I_MPI_DEBUG=1 mpiexec --hostfile $PBS_NODEFILE -np 2  ./pi_efa
[0] MPI startup(): libfabric version: 1.9.0a1

[0] MPI startup(): libfabric provider: efa

compute-041249
compute-041249
  pi is approximately:  3.1415926769620652  Relative Error is:  -0.20387909E-05
 Integration Wall Time = 0.005503 Seconds on       2 Processors for n =  10000000
 

but when two nodes are involved it hangs. In this case in a mpi_barrier() call.

[sheistan@compute-041249 ~]$ I_MPI_DEBUG=1 mpiexec --hostfile $PBS_NODEFILE -np 2 -ppn 1 ./pi_efa
[0] MPI startup(): libfabric version: 1.9.0a1

[0] MPI startup(): libfabric provider: efa

compute-041116
compute-041249
^C[mpiexec@compute-041249] Sending Ctrl-C to processes as requested
[mpiexec@compute-041249] Press Ctrl-C again to force abort
forrtl: error (69): process interrupted (SIGINT)
Image              PC                Routine            Line        Source
pi_efa             0000000000404724  Unknown               Unknown  Unknown
libc-2.26.so       00007F80892447E0  Unknown               Unknown  Unknown
libfabric.so.1.11  00007F8088D5A007  Unknown               Unknown  Unknown
libfabric.so.1.11  00007F8088D5B170  Unknown               Unknown  Unknown
libfabric.so.1.11  00007F8088D0B38D  Unknown               Unknown  Unknown
libfabric.so.1.11  00007F8088D0A72E  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007F808A7CAC90  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007F808A0EBF7B  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007F808A7F5B2F  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007F808A3F42C0  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007F808A052F7C  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007F808A054138  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007F808A17F4E2  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00007F808A05446E  MPI_Barrier           Unknown  Unknown
libmpifort.so.12.  00007F808AF1573C  pmpi_barrier          Unknown  Unknown
pi_efa             0000000000402EF9  Unknown               Unknown  Unknown
pi_efa             0000000000402E52  Unknown               Unknown  Unknown
libc-2.26.so       00007F808923102A  __libc_start_main     Unknown  Unknown
pi_efa             0000000000402D6A  Unknown               Unknown  Unknown
 

Thoughts on something to try or something I missed?

 

thanks

 

s

 

 

 

installing parallel_studio_xe_2019_cluster_edition in a cluster for a non root user with student license

$
0
0

Dear Intel support team,

Please help for below points:

Package "parallel_studio_xe_2019_cluster_edition.tgz" has been downloaded from the Intel student's account of Intel website.  This package needs to be deployed at the HPC cluster of 21 compute nodes and 1 master nodes as a non-root user.  We like to install Intel packages at path /home/apps/intel so that all server can access this lustre file system from the compute nodes and master nodes. 

1. Do we just install package at master node with the standalone license will be sufficient to use intel compilers at compute and master nodes? 

2. Do we need an license manager for it? 

3. Is standalone license will work as floating license if we deploy license manager? if yes, single license means single user can run it at any given time right? 

4. How shall we install intel packages without super user account at RHEL Linux? .

Thanking you 

MPI hangs with multiple processes on same node - Intel AI DevCloud

$
0
0

Hello,

I am using Intel AI DevCloud to run a Deep Reinforcement Learning training, using mpi4py to use several agents to collect data at the same time.

In my framework, I run N jobs (in different nodes) with agents and another job with the optimization algorithm in python.

When I run a single agent in each job, the application works correctly. However, when I try to run more than one agent in the same job (same node), the application hangs.

I do not think the problem is the application itself because it works when there is a single agent per node. Additionally, the same application used to work with multiple agents per node last year, when Intel DevCloud was CentOS.

The application is in this code: https://github.com/alexandremuzio/baselines/blob/71977df495e7840179dd05e...

 

Probably I am not presenting enough data about the problem, so if you need any information regarding MPI stuff I can send in this post.

 

Bad file descriptor for 80+ ranks on single host

$
0
0

Hi,

I am unable to launch a simple MPI application using more than 80 processes on a single host using Intel MPI 2018 update 1 and PBS Pro as job scheduler.

The job is submitted with a script containing:

#PBS -l select=81:ncpus=1
mpiexec.hydra -n 81 -ppn 1 ./a.out

In the call to MPI_Init, the following error is raised on rank 80:

[cli_80]: write_line error; fd=255 buf=:cmd=init pmi_version=1 pmi_subversion=1
:
system msg for write_line failure : Bad file descriptor
[cli_80]: Unable to write to PMI_fd
[cli_80]: write_line error; fd=255 buf=:cmd=barrier_in
:
system msg for write_line failure : Bad file descriptor
[cli_80]: write_line error; fd=255 buf=:cmd=get_ranks2hosts
:
system msg for write_line failure : Bad file descriptor
[cli_80]: expecting cmd="put_ranks2hosts", got cmd=""
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(805): fail failed
MPID_Init(1743)......: channel initialization failed
MPID_Init(2144)......: PMI_Init returned -1
[cli_80]: write_line error; fd=255 buf=:cmd=abort exitcode=68204815
:
system msg for write_line failure : Bad file descriptor

I looked closer into the issue by running the application through strace. The output for rank 80 shows that the process tries to read from the bash internal file descriptor 255:

<snip>
uname({sys="Linux", node="uvtk", ...})  = 0
sched_getaffinity(0, 128,  { 0, 0, 0, 0, 80000, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }) = 128
write(255, "cmd=init pmi_version=1 pmi_subve"..., 40) = -1 EBADF (Bad file descriptor)
write(2, "[cli_80]: ", 10[cli_80]: )              = 10
write(2, "write_line error; fd=255 buf=:cm"..., 72write_line error; fd=255 buf=:cmd=init pmi_version=1 pmi_subversion=1
:
) = 72
write(2, "system msg for write_line failur"..., 56system msg for write_line failure : Bad file descriptor
) = 56
write(2, "[cli_80]: ", 10[cli_80]: )              = 10
write(2, "Unable to write to PMI_fd\n", 26Unable to write to PMI_fd
) = 26
uname({sys="Linux", node="uvtk", ...})  = 0
write(255, "cmd=barrier_in\n", 15)      = -1 EBADF (Bad file descriptor)
write(2, "[cli_80]: ", 10[cli_80]: )              = 10
write(2, "write_line error; fd=255 buf=:cm"..., 47write_line error; fd=255 buf=:cmd=barrier_in
:
) = 47
write(2, "system msg for write_line failur"..., 56system msg for write_line failure : Bad file descriptor
) = 56
write(255, "cmd=get_ranks2hosts\n", 20) = -1 EBADF (Bad file descriptor)
write(2, "[cli_80]: ", 10[cli_80]: )              = 10
write(2, "write_line error; fd=255 buf=:cm"..., 52write_line error; fd=255 buf=:cmd=get_ranks2hosts
:
) = 52
write(2, "system msg for write_line failur"..., 56system msg for write_line failure : Bad file descriptor
) = 56
read(255, 0x7fffe00d6320, 1023)         = -1 EBADF (Bad file descriptor)
write(2, "[cli_80]: ", 10[cli_80]: )              = 10
write(2, "expecting cmd=\"put_ranks2hosts\","..., 44expecting cmd="put_ranks2hosts", got cmd=""
) = 44
write(2, "Fatal error in MPI_Init: Other M"..., 187Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(805): fail failed
MPID_Init(1743)......: channel initialization failed
MPID_Init(2144)......: PMI_Init returned -1
) = 187
write(255, "cmd=abort exitcode=68204815\n", 28) = -1 EBADF (Bad file descriptor)
write(2, "[cli_80]: ", 10[cli_80]: )              = 10
write(2, "write_line error; fd=255 buf=:cm"..., 60write_line error; fd=255 buf=:cmd=abort exitcode=68204815
:
) = 60
write(2, "system msg for write_line failur"..., 56system msg for write_line failure : Bad file descriptor
) = 56
exit_group(68204815)                    = ?

All other ranks communicate with pmi_proxy via a valid file descriptor. For example:

<snip>
uname({sys="Linux", node="uvtk", ...})  = 0
sched_getaffinity(0, 128,  { 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }) = 128
write(16, "cmd=init pmi_version=1 pmi_subve"..., 40) = 40
read(16, "cmd=response_to_init pmi_version"..., 1023) = 57
write(16, "cmd=get_maxes\n", 14)        = 14
read(16, "cmd=maxes kvsname_max=256 keylen"..., 1023) = 56
uname({sys="Linux", node="uvtk", ...})  = 0
write(16, "cmd=barrier_in\n", 15)       = 15
read(16,  <unfinished ...>

Is it possible to specify  a list of available file descriptors used by MPI processes or any other way to circumvent this behavior?

Regards,
Pieter
 

error LNK2001: unresolved external symbol

$
0
0

Hi, I am using the fortran90 with mpi. I followed the steps on this website to link the mpi with visual studio 2017. But it didn't work. It shows error LNK2001 to every MPI commands. Like error LNK2001: unresolved external symbol _MPI_ALLREDUCE; error LNK2001: unresolved external symbol _MPI_BARRIER  .etc. I can not even compile the test.f90 code.

https://software.intel.com/en-us/mpi-developer-guide-windows-configuring...

These attached pictures show my configuration. Is there any problem? Have someone met the same problem before? How can I solve it?

Thanks a lot!


Trouble installing Intel MPI in a docker container

$
0
0

I've been trying to get Intel MPI installed on top of a windowsservercore base image. The MPI installer fails during installation of studio_common_libs_p* sub-installation.

I've attached the logs, but the relevant error is:

WixCreateInternetShortcuts:  Creating IUniformResourceLocatorW shortcut 'C:\ProgramData\Microsoft\Windows\Start Menu\Programs\Intel Parallel Studio XE 2018\Online Service Center.url' target 'https://supporttickets.intel.com'
WixCreateInternetShortcuts:  Error 0x80040154: failed to create an instance of IUniformResourceLocatorW
WixCreateInternetShortcuts:  Error 0x80040154: failed to create Internet shortcut
CustomAction WixCreateInternetShortcuts returned actual error code 1603 (note this may not be 100% accurate if translation happened inside sandbox)
 

This looks related to creating a URL shortcut and searching around suggests it's because Explorer is not part of the image. I certainly don't need these shortcuts. All I need is a minimal installation so that I can use MPI from withing the container. I don't need any development support headers or libraries, etc.

Is there a workaround for this issue?

Jeff

MPI shared communication with slided access

$
0
0

Hello

 

I have two questions regarding using mpi shared memory communication

 

 

1) If I have a MPI rank which is the only one that writes to a window, is it necessary to employ mpi_win_lock, and mpi_win_unlock?
I know that my application would never have others trying to write to that window. They only read the content of the window, and I make sure that they read after a MPI_BARRIER, so the content of the window has been updated.

 

 

2) In my application I have one MPI rank, which allocates a shared window that needs to be read by 1:N other MPI ranks.

MPI rank 1 shall only read: rma(1:10) 

MPI rank 2 shall only read rma(11:20) 

MPI rank N shall only read  rma(10*(N-1)+1:10*N)

Currently, all 1 to N ranks  are querying the whole shared window, i.e. the size "10*N" with MPI_WIN_SHARED_QUERY.

 

I am asking if it is possible to apply the MPI_WIN_SHARED_QUERY function such that MPI rank 1 only can access the window from 1:10 and rank 2 from 11:20 etc.

In this way, each rank has a local accessing from 1:10 but they refer to different chunks of the shared window? Is this possible?

 

 

Thanks very much!

Intel MPI 2019 with Mellanox ConnectX-5 / provider=ofi_rxm

$
0
0

Hi, we are currently standing up a new cluster with Mellanox ConnectX-5 adapters. I have found that using openMPI, mvapich2, and intel2018-mpi, we can run MPI jobs on all 960 cores in the cluster, however, using intel2019-mpi we can't get beyond ~300 mpi ranks. If we do, we get the following error for every rank: 

Abort(273768207) on node 650 (rank 650 in comm 0): Fatal error in PMPI_Comm_split: Other MPI error, error stack: 
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=0, key=650, new_comm=0x7911e8) failed 
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIR_Allgather_intra_auto(145).........: Failure during collective 
MPIR_Allgather_intra_auto(141).........: 
MPIR_Allgather_intra_brucks(115).......: 
MPIC_Sendrecv(344).....................: 
MPID_Isend(662)........................: 
MPID_isend_unsafe(282).................: 
MPIDI_OFI_send_lightweight_request(106): 
(unknown)(): Other MPI error 
---------------------------------------------------------------------------------------------------------- 
This is using the default FI_PROVIDER of ofi_rxm. If we switch to using "verbs", we can run all 960 cores, but tests show an order of magnitude increase in latency and much longer run times. 

We have tried installing our own libfabrics (from the git repo ; also we verified with verbose debugging that we are using this libfabrics) and this behavoir does not change

Is there anything I can change to allow all 960 cores using the default ofi_rxm provider?  Or, is there a way to improve performance using the verbs provider?

For completeness: 
Using MLNX_OFED_LINUX-4.6-1.0.1.1-rhel7.6-x86_64 ofed 
CentOS 7.6.1810 (kernel = 3.10.0-957.21.3.el7.x86_64) 
Intel Parallel studio version 19.0.4.243 
Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5] 

Thanks! 

Eric
 

Fortran+MPI program can't run on OPA when using variable format expressions

$
0
0
The code:
use mpi
implicit none
real*8 send_data(10)
integer ierr,mynode,numnodes,status(10)
integer k

send_data=0.0d0
call mpi_init(ierr)
call mpi_comm_rank(MPI_Comm_World, mynode, ierr)
call mpi_comm_size(MPI_Comm_World, numnodes, ierr)
k=10
if(mynode==0) write(*,100)send_data(:10)
100 format(<k>I4) !can't run with OPA Error
!100 format(10I4) !can run on OPA
call mpi_finalize(ierr)
end

Compiler: 2019.update2 and 2019.update4

Compile command: mpiifort  

OS: CentOS Linux release 7.6.1810

opainfo(10.7.0.0.133) output: 

hfi1_0:1                           PortGID:0xfe80000000000000:00117509010ad11e
   PortState:     Active
   LinkSpeed      Act: 25Gb         En: 25Gb        
   LinkWidth      Act: 4            En: 4           
   LinkWidthDnGrd ActTx: 4  Rx: 4   En: 3,4         
   LCRC           Act: 14-bit       En: 14-bit,16-bit,48-bit       Mgmt: True 
   LID: 0x00000001-0x00000001       SM LID: 0x00000001 SL: 0 
         QSFP Copper,       3m  Hitachi Metals    P/N IQSFP26C-30       Rev 01
   Xmit Data:               1424 MB Pkts:             10048182
   Recv Data:               7270 MB Pkts:             10871417
   Link Quality: 5 (Excellent)

Run: $ mpirun -n 4 -env I_MPI_FABRICS shm:ofi ./a.out

Output:

geospace.430313hfi_userinit: mmap of status page (dabbad0008030000) failed: Operation not permitted
geospace.430313psmi_context_open: hfi_userinit: failed, trying again (1/3)
geospace.430313hfi_userinit: assign_context command failed: Invalid argument
geospace.430313psmi_context_open: hfi_userinit: failed, trying again (2/3)
geospace.430313hfi_userinit: assign_context command failed: Invalid argument
geospace.430313psmi_context_open: hfi_userinit: failed, trying again (3/3)
geospace.430313hfi_userinit: assign_context command failed: Invalid argument
geospace.430313PSM2 can't open hfi unit: -1 (err=23)
Abort(1619087) on node 3 (rank 3 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(666)..........: 
MPID_Init(922).................: 
MPIDI_NM_mpi_init_hook(1070)...: 
MPIDI_OFI_create_endpoint(1751): OFI endpoint open failed (ofi_init.h:1751:MPIDI_OFI_create_endpoint:Invalid argument)

But if it changed from "100 format(<k>I4)" to "100 format(10I4)", it can run well.

Thank you.

using MPI_MODE_EXCL in MPI_File_open

$
0
0

When trying to open file with MPI_File_open using MPI_MODE_EXCL, error Should be return if creating file that already exists. But, is the file opened eventually and a valid file handle returned?

thanks

 

 

Floating point exception mpiexec.hydra "$@" 0

$
0
0

Hi everyone,

I struggle to make Intel® Parallel Studio XE 2019 to run this simple hello_mpi.f90:

program hello_mpi
   implicit none
   include 'mpif.h'
   integer  ::  rank, size, ierror, tag
   integer  ::  status(MPI_STATUS_SIZE)

   call MPI_INIT(ierror)
   call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
   call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
   print*, 'node', rank, ': Hello world'
   call MPI_FINALIZE(ierror)
end program hello_mpi

I tried first with Update3 and now Update4, and although the code compiles I get this error when trying to run it:

mpirun -np 4 ./hello_mpi
.../intel/compilers_and_libraries_2019.4.243/linux/mpi/intel64/bin/mpirun: line 103: 13574 Floating point exceptionmpiexec.hydra "$@" 0<&0

Here is what gdb --args mpiexec.hydra -n 2 hostname returns:

GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-114.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /opt/uio/intel/compilers_and_libraries_2019.4.243/linux/mpi/intel64/bin/mpiexec.hydra...done.
(gdb) run
Starting program: .../intel/compilers_and_libraries_2019.4.243/linux/mpi/intel64/bin/mpiexec.hydra -n 2 hostname
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Program received signal SIGFPE, Arithmetic exception.
IPL_MAX_CORE_per_package () at ../../../../../src/pm/i_hydra/../../intel/ipl/include/../src/ipl_processor.c:336
336     ../../../../../src/pm/i_hydra/../../intel/ipl/include/../src/ipl_processor.c: No such file or directory.
Missing separate debuginfos, use: debuginfo-install intel-mpi-rt-2019.4-243-2019.4-243.x86_64
 

Can someone help?

Thanks in advance,

Jean

Floating Point Exception Overflow and OpenMPI equiv tunning?

$
0
0

I am working with star ccm+ 2019.1.1 Build 14.02.012
CentOS 7.6 kernel 3.10.0-957.21.3.el7.x86_64
Intel MPI Version 2018 Update 5 Build 20190404 (this is version shipped with star ccm+)
Cisco UCS cluster using USNIC fabric over 10gbe
Intel(R) Xeon(R) CPU E5-2698
7 nodes, 280 cores

enic RPM version kmod-enic-3.2.210.22-738.18.centos7u7.x86_64 installed
usnic RPM kmod-usnic_verbs-3.2.158.15-738.18.rhel7u6.x86_64 installed
enic modinfo version: 3.2.210.22
enic loaded module version: 3.2.210.22
usnic_verbs modinfo version: 3.2.158.15
usnic_verbs loaded module version: 3.2.158.15
libdaplusnic RPM version 2.0.39cisco3.2.112.8 installed
libfabric RPM version 1.6.0cisco3.2.112.9.rhel7u6 installed

On runs less than 5 hours, everything works flawlessly and is quite fast.

However when running with 280 cores at or around 5 hours into a job, the longer jobs die with the floating point exception.
The same job completes fine with 140 cores, but takes about 14 hours to finish. 
Also I am using PBS Pro with 99 hour wall time

------------------
Turbulent viscosity limited on 56 cells in Region
A floating point exception has occurred: floating point exception [Overflow].  The specific cause cannot be identified.  Please refer to the troubleshooting section of the User's Guide.
Context: star.coupledflow.CoupledImplicitSolver
Command: Automation.Run
   error: Server Error
------------------

I have been doing some reading and some say that using other MPI are more stable with Star CCM.

I have not ruled out that I am missing some parameters or tuning with Intel MPI as this is a new cluster.

I am also trying to make Open MPI work.  I have openmpi compiled and it runs, however only with very small number of CPU.  Anything over about 2 cores per node it hangs indefinately.

I have compiled Open MPI 3.1.3 from https://www.open-mpi.org/ because this is what Star CCM version I am running supports.  I am telling star to use the open mpi that I installed so it can support the Cisco USNIC fabric, which I can verify using Cisco native tools.  Note that star also ships with openmpi however 

I am thinking that I need to tune OpenMPI, which was also requried with Intel MPI.

With Intel MPI, jobs with more than about 100 cores would hang until I added these parameters:

reference: https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technolog...
reference: https://software.intel.com/en-us/articles/tuning-the-intel-mpi-library-a...

export I_MPI_DAPL_UD_SEND_BUFFER_NUM=8208
export I_MPI_DAPL_UD_RECV_BUFFER_NUM=8208
export I_MPI_DAPL_UD_ACK_SEND_POOL_SIZE=8704
export I_MPI_DAPL_UD_ACK_RECV_POOL_SIZE=8704
export I_MPI_DAPL_UD_RNDV_EP_NUM=2
export I_MPI_DAPL_UD_REQ_EVD_SIZE=2000
export I_MPI_DAPL_UD_MAX_MSG_SIZE=4096
export I_MPI_DAPL_UD_DIRECT_COPY_THRESHOLD=2147483647

After adding these parms I can scale to 280 cores and it runs very fast, up until the point where it gets the floating point exception.

I am banging my head against a wall trying to find equivelant turning parms for Open MPI.

I have listed all the MCA available with Open using MCA, and have tried setting these parms with no success.

btl_max_send_size = 4096
btl_usnic_eager_limit = 2147483647
btl_usnic_rndv_eager_limit = 2147483647
btl_usnic_sd_num = 8208
btl_usnic_rd_num = 8208
btl_usnic_prio_sd_num = 8704
btl_usnic_prio_rd_num = 8704
btl_usnic_pack_lazy_threshold = -1

Does anyone have any advice or ideas for:

1.) The floating point overflow issue
and   
2.)  Know of equivelant tuning parms for Open MPI 

Many thanks in advance


Intel MPI questions

$
0
0

So I've been trying to unconfuse myself about the various fabrics/transports supported by intel mpi 2018/19 as used for the `I_MPI_FABRICS` variable and related ones. I have a few questions I'm hoping someone can help with - they're all related so have put them in one thread:

1. Is there a way of getting intel mpi to output which transports/fabrics it thinks are available on a machine? Or is it just a question of trying each one with fallback disabled?

2. 2018 has a `tcp` fabric while 2019's `OFI` fabric has a `TCP` provider. Am I right in thinking these are *not* the same, with the former not using `libfabric` at all?

3. 2019 (only?)'s `OFI` fabric has an `RxM` provider. The other OFI providers seem tied to a specific hardware but I'm not clear if this one is, or is it something more fundamental?

4. 2018 had a `ofa` fabric which says it supports InfiniBand through OFED Verbs:

a) Am I right in thinking this is *not* the same verbs interface as is provided by the `ofi` fabric's `verbs` provider?

b) Does the OFA/OFED Verbs interface support anything other than InfiniBand?

Many thanks for any answers!

ERROR: ld.so: object ''libVT.so' from LD_PRELOAD cannot be preloaded: ignored.

$
0
0

Dear all,

I am using Intel Parallel Studio XE 2017.6 in order to trace a HYBRID OPENMP/MPI application.

I use:
```mpiexec.hydra -trace "libVT.so libmpi.so" python ....py args```
and although the application runs fine and an .stf file is created with reasonable results
the log file of my application's execution gives me the error:

ERROR: ld.so: object ''libVT.so' from LD_PRELOAD cannot be preloaded: ignored.

I would expect this error to be resolved by using :
export LD_PRELOAD=.../libVT.so

however it still persists.

In case where I remove "libVT.so libmpi.so" from the command above I get:

ERROR: ld.so: object ''libVT.so' from LD_PRELOAD cannot be preloaded: ignored.
python: symbol lookup error: /rdsgpfs/general/apps/intel/2017.6/itac/2017.4.034/intel64/slib/libVT.so: undefined symbol: PMPI_Initialized

and my application terminates without success.
Does that mean that even if it complains for faulty preloading, it still uses that? (I guess yes.)
Should I trust the results I get for the `succesfull` however `complaining` execution?

I will be more than happy to help with more info if needed.

Thank you in advance,
George Bisbas

 

Performance difference running IMB-MPI1 with mpirun -PSM2 and -OFI

$
0
0

Hi, I found that the performance difference is significant when running

I_MPI_DEBUG=1 mpirun -PSM2 -host node1 -n 1 ./IMB-MPI1 Sendrecv : -host node2 -n 1 ./IMB-MPI1
......
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   Mbytes/sec
......
      4194304           10       364.40       364.52       364.46     23012.87

and

I_MPI_DEBUG=1 mpirun -OFI -host node1 -n 1 ./IMB-MPI1 Sendrecv : -host node2 -n 1 ./IMB-MPI1
......
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   Mbytes/sec
......
      4194304           10       487.40       487.80       487.60     17196.66

 Output of the latter seems to indicate that it uses psm2 backend too.

[0] MPID_nem_ofi_init(): used OFI provider: psm2
[0] MPID_nem_ofi_init(): max_buffered_send 64
[0] MPID_nem_ofi_init(): max_msg_size 64
[0] MPID_nem_ofi_init(): rcd switchover 32768
[0] MPID_nem_ofi_init(): cq entries count 8
[0] MPID_nem_ofi_init(): MPID_REQUEST_PREALLOC 128
#------------------------------------------------------------
#    Intel (R) MPI Benchmarks 2018 Update 1, MPI-1 part    
#------------------------------------------------------------
# Date                  : Tue Sep 10 17:36:59 2019
# Machine               : x86_64
# System                : Linux
# Release               : 4.4.175-89-default
# Version               : #1 SMP Thu Feb 21 16:05:09 UTC 2019 (585633c)
# MPI Version           : 3.1
# MPI Thread Environment: 
.......

It's getting more wired when I run with I_MPI_FABRICS set, I get only an error.

I_MPI_DEBUG=1 I_MPI_FABRICS=shm,psm2 mpirun -host node81 -n 1 ./IMB-MPI1 Sendrecv : -host node82 -n 1 ./IMB-MPI1
[1] MPI startup: syntax error in intranode path of I_MPI_FABRICS = shm,psm2 and fallback is disabled, allowed value(s) shm,ofi,tmi,dapl,ofa,tcp
[0] MPI startup: syntax error in intranode path of I_MPI_FABRICS = shm,psm2 and fallback is disabled, allowed value(s) shm,ofi,tmi,dapl,ofa,tcp

Is the performance difference expected results? If so, can I make mpirun defaults to use -PSM2 by changing environment or configurations? (except aliasing mpirun to "mpirun -PSM2" of course)

support for Fortran USE MPI_F08 on Windows OS

$
0
0

Does Intel 19 cluster edition for Windows support the use of USE MPI_F08 in Fortran applications? When I try to compile code with this module on Windows with Intel 19.4 compilers, I get the following error:

error #7002: Error in opening the compiled module file.  Check INCLUDE paths.   [MPI_F08]
        use mpi_f08
------------^
compilation aborted

I have,

mpifc.bat for the Intel(R) MPI Library 2019 Update 4 for Windows*
Copyright 2007-2019 Intel Corporation.

IMB-MPI1 pingpong tested failed with 4M message size

$
0
0

Hi All, 

I have two Dell R815 server with 4 AMD opteron 6380 (16 cores each) connected directly by two infiniband cards. I have trouble running the IMB-MPI1 test even on a single node:

mpirun -n 2 -genv I_MPI_DEBUG=3 -genv I_MPI_FABRICS=ofi  /opt/intel/impi/2019.5.281/intel64/bin/IMB-MPI1

The run aborted with the following error:

#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000         2.23         0.00
            1         1000         2.24         0.45
            2         1000         2.25         0.89
            4         1000         2.26         1.77
            8         1000         2.24         3.57
           16         1000         2.25         7.12
           32         1000         2.27        14.08
           64         1000         2.43        26.33
          128         1000         2.55        50.26
          256         1000         3.60        71.08
          512         1000         4.12       124.40
         1024         1000         5.04       203.00
         2048         1000         6.89       297.38
         4096         1000        10.56       387.76
         8192         1000        13.98       585.83
        16384         1000        22.74       720.65
        32768         1000        30.12      1087.81
        65536          640        46.17      1419.45
       131072          320        76.43      1714.87
       262144          160       334.23       784.32
       524288           80       511.22      1025.57
      1048576           40       850.76      1232.51
      2097152           20      1518.37      1381.19
Abort(941742351) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Send: Other MPI error, error stack:
PMPI_Send(155)............: MPI_Send(buf=0x3a100f0, count=4194304, MPI_BYTE, dest=1, tag=1, comm=0x84000003) failed
MPID_Send(572)............:
MPIDI_send_unsafe(203)....:
MPIDI_OFI_send_normal(414):
(unknown)(): Other MPI error

However, it runs fine with shm:

mpirun -n 2 -genv I_MPI_DEBUG=3 -genv I_MPI_FABRICS=shm  /opt/intel/impi/2019.5.281/intel64/bin/IMB-MPI1

Try to run with 2 CPUs on two different nodes also fail at 4M message size. 

I have been struggling with this for a few days now without success. Any suggestions where to look at or what to try?

Thanks!

Qi

Viewing all 930 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>