Quantcast
Channel: Clusters and HPC Technology
Viewing all 930 articles
Browse latest View live

Severe Memory Leak with 2019 impi

$
0
0

Both 2019 impi releases have a severe memory leak which goes away when I regress to the 2015 version (i.e. source /opt/intel/comp2015/impi/5.0.2.044/intel64/bin/mpivars.sh). I am attaching two valgrind outputs, lapw1.vg.285276 from 2019 impi and lapw1.vg.5451 from 2015 impi which show it quite clearly.

For reference, the entries in the valgrind logs with "init_parallel_ (in /opt/Wien2k_18.1F/lapw1Q_mpi)" are the mpi initialization, so these are almost certainly not real as the initialization is only done once. The entries associated with the scalapack pdsygst call are probably the culprit.

If needed I can provide a package to reproduce this. It is part of a large code, so decomposing into a small test code is not feasible.


The parameter localroot is not recognized at start of run

$
0
0

Hi,

I use MPI to parallellize parts in my Quickwin project run with Fortran 2019 Cluster Edition. Earlier I was helped by

Intel to manage my QuickkWin graphics output by using the parameter localroot. This works fine with the 2017 update 4 version. When I goto

the 2019 version, localroot is not recognized. No change has been introduced in the command file starting the execution.

The error report is attached where the command file is exposed.

Best regards

Anders S

-perhost parameter forgotten after first iteration over all hosts

$
0
0

Dear developers,

the round-robin placement forgets about the perhost parameter once it iterated over all hosts in the hostfile.
This was tested with Intel MPI 2019.1.

My hostfile looks like:

node551
node552

And when I start a small job, I get:

I_MPI_DEBUG=4 I_MPI_PIN_DOMAIN=core mpirun -f hostfile -n 8 -perhost 2  ./a.out
[0] MPI startup(): libfabric version: 1.7.0a1-impi
[0] MPI startup(): libfabric provider: verbs;ofi_rxm
[0] MPI startup(): Rank    Pid      Node name  Pin cpu
[0] MPI startup(): 0       377136   node551   {0,40}
[0] MPI startup(): 1       377137   node551   {1,41}
[0] MPI startup(): 2       151304   node552   {0,40}
[0] MPI startup(): 3       151305   node552   {1,41}
[0] MPI startup(): 4       377138   node551   {2,42}
[0] MPI startup(): 5       151306   node552   {2,42}
[0] MPI startup(): 6       377139   node551   {3,43}
[0] MPI startup(): 7       151307   node552   {3,43}

ranks 0-3 are distributed as expected, but ranks 4-7 are distributed across the hosts as if the perhost parameter is reset to 1.

Intel MPI Not running on windows 10 Dell 7920 workstation

$
0
0

I have to run a parallel job and usually after installing IntelMPI the software package used to run Parallel processor jobs. I don't have much idea around the configuration of Intel MPI. Need help to troubleshoot MPI and configure it.

error comes when program starts to launch MPI communications.

pgCC binder for MPI

$
0
0

Hi

I'm trying to compile the binding libraries for the PGI C++ compiler. In the readme the following is stated:

II.2.2. C++ Binding

To create the Intel(R) MPI Library C++ binding library using the
PGI* C++ compiler, do the following steps:

1. Make sure that the PGI* C++ compiler (pgCC) is in your PATH.

2. Go to the directory cxx

3. Run the command

   # make MPI_INST=<MPI_path> CXX=<C++_compiler> NAME=<name> \
     [ARCH=<arch>] [MIC=<mic option>]

   with

   <MPI_path>        - installation directory of the Intel(R) MPI Library
   <C++_compiler>    - compiler to be used
   <name>            - base name for the libraries and compiler script
   <arch>            - set `intel64` or `mic` architecture, `intel64` is used by
                       default
   <mic option>      - compiler option to generate code for Intel(R) MIC
                       Architecture. Availalbe only when ARCH=mic is set, `-mmic`
                       is used by default in such case

4. Copy the resulting <arch> directory to the Intel(R) MPI Library installation
   directory.

Am I trying to compile with the following statement:

make MPI_INST=/prog/Intel/studioxe2016/compilers_and_libraries_2016.3.210/linux/mpi CXX=pgCC NAME=pgCC

which gives this output:

pgCC  -c -fpic -I/prog/Intel/studioxe2016/compilers_and_libraries_2016.3.210/linux/mpi/intel64/include -Iinclude -Iinclude/intel64 -o initcxx.o initcxx.cxx
"include/intel64/mpichconf.h", line 1362: catastrophic error: cannot open
          source file "nopackage.h"
  #include "nopackage.h"
                        ^

1 catastrophic error detected in the compilation of "initcxx.cxx".
Compilation terminated.
make: *** [initcxx.o] Error 2

Does anybody have an idea where I can get this nopackage.h, or why this error comes?

I have successfully compiled binders for both pgc and pgf90 without any issues.

How can I CreateProcessAsUser by hydra_service

$
0
0

hi,

I Create a windows_shared_memory in user application, and open it in other process which launched by mpiexec.exe。

The problem is the mpi process can not open windows_shared_memory (ERROR _NOT_FOUND_FILE).

i think the reason is the user application under session 1, while the mpi process under session 0(because hydra_service under session 0)。

how can i do.

best wishes.

MPI and Quantum Espresso

$
0
0

Dear experts,

I am having difficulty using MPI from parallel studio cluster edition 2016 in conjunction with Quantum Espresso PWSCF v 6.3. 

I think the problems may be inter-related and are to do with MPI-communicators. I compiled pw.x with Intel compilers, with Intel MPI, Intel ScaLapack and MKL, but without OpenMP.

I have been running pw.x with multiple processes quite successfully, however when the number of processes is high enough, such that the space group has more than 7 processes, where the subspace diagonalization no longer uses a serial algorithm, the program crashes abruptly at about the 10th iteration with the following errors;

Fatal error in PMPI_Cart_sub: Other MPI error, error stack: PMPI_Cart_sub(242)...................: MPI_Cart_sub(comm=0xc400fcf3, remain_dims=0x7ffe0b27a6e8, comm_new=0x7ffe0b27a640) failed PMPI_Cart_sub(178)...................: MPIR_Comm_split_impl(270)............: MPIR_Get_contextid_sparse_group(1330): Too many communicators (0/16384 free on this process; ignore_id=0) Fatal error in PMPI_Cart_sub: Other MPI error, error stack: PMPI_Cart_sub(242)...................: MPI_Cart_sub(comm=0xc400fcf3, remain_dims=0x7ffefaee7ce8, comm_new=0x7ffefaee7c40) failed PMPI_Cart_sub(178)...................: MPIR_Comm_split_impl(270)............:

On the pw forum, I got this response;

'a careful look at the error message reveals, that you are running out of space for MPI communicators for which a fixed maximum number (16k) seems to be allowed. this hints at a problem somewhere that communicators are generated with MPI_Comm_split() and not properly cleared afterwards.'

But I don't know how to fix this..

Please kindly advise,

Many thanks

Alex Durie

PhD student

IRECV/SSEND crashes for Intel MPI Library 2019

$
0
0

Hi,

I noticed that one of our MPI codes begin crashing after installing Intel Parallel Studio XE 2019 (Intel MPI Library 2019 Update 1) on Windows.  I tracked down the issue to a combination of SSEND/IRECV when the transferred data reaches a certain size.  Test code exhibiting the crash is attached.  The code does not crash when using Intel Parallel Studio XE 2018 (Intel MPI Library Update 3). 

In particular, the 2019 library exhibits a crash when the double precision (square) matrix being transferred has a dimension of around 360-365 in the vicinity of 135K total elements.  The crash occurs for both the 4-byte and 8-byte MPI interfaces.  My compile and dispatch commands are

mpiifort -fpp -DMPI_MPI_INTEGER_TYPE=4 -DMPI_SYS_INTEGER_TYPE=4 test.F90
mpiexec -n 2 ./test.exe

for the 4-byte interface and

mpiifort -ilp64 -i8 -fpp -DMPI_MPI_INTEGER_TYPE=8 -DMPI_SYS_INTEGER_TYPE=8 test.F90
mpiexec -n 2 ./test.exe

for the 8-byte interface. 

 

Any help or suggested workaround is much appreciated.

 

Thanks,

John

 

AttachmentSize
Downloadapplication/octet-streamtest.F903.29 KB

mpiexec -hosts differences in MPICH and Intel MPI

$
0
0

What is differences in handling mpiexec -hosts in MPICH and Intel MPI? It seems that Inte MPI doesn't recognize <host>:<number processes> syntax.

dapl async_event QP

$
0
0

Hello

I am facing the following errors on intel/2018.2, with intelmpi/2018.2 using mpiexec to submit my cluster simulations.

dapl async_event CQ (0x1750ff0) ERR 0
dapl_evd_cq_async_error_callback (0x169ada0, 0x16cf460, 0x2ab4fecb9d30, 0x1750ff0)
dapl async_event QP (0x1fdacc0) Event 1

After this point my runs terminate. Any assistance with resolving this error would be much appreciated.

Alexandra

Execution error using the Educator Intel Parallel Studio XE Cluster Development tools for Linux UBUNTU 18.04

$
0
0

I just loaded the Educator Intel Parallel Studio XE Cluster Development tools for Linux UBUNTU 18.04. It compiles and runs C, C++ and Fortran files fine, but when I used the iMPi library I get the error free(): invalid next size (fast) error.  It appears that most of the documented errors are due to memory allocation errors, but not in my case since I don't explicitly allocate memory .   The source code I tested with is the hello world program for mpi as follows:

#include <stdio.h>
#include <stdlib.h>

#include "mpi.h"

int main( int argc, char *argv[]) {
   int rank;

   MPI_Init( &argc, &argv);
   MPI_Comm_rank( MPI_COMM_WORLD, &rank);

   printf("rank:%d Hello World.\n", rank);
   MPI_Finalize();
   return 0;
}

It returns exit  code 134 with the error "free(): invalid next size (fast) error\n Aborted (core dumped)" .  Using strace as follows:

strace mpirun -n 1 ./hello_mpi

I get the error message trying to start mpiexec.hyra

stat("/opt/intel//compilers_and_libraries_2019.1.144/linux/mpi/intel64/bin/mpiexec.hydra", {st_mode=S_IFREG|0755, st_size=1887795, ...}) = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f6851924a10) = 7499
wait4(-1, free(): invalid next size (fast)
[{WIFSIGNALED(s) && WTERMSIG(s) == SIGABRT && WCOREDUMP(s)}], 0, NULL) = 7499
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_DUMPED, si_pid=7499, si_uid=1000, si_status=SIGABRT, si_utime=0, si_stime=0} ---
rt_sigreturn({mask=[]})                 = 7499
write(2, "Aborted (core dumped)\n", 22Aborted (core dumped)
) = 22
pipe([3, 4])                            = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f6851924a10) = 7504

Running ldd on the executable I get

ldd hello_mpi
    linux-vdso.so.1 (0x00007ffdffdfe000)
    libmpi.so.12 => /opt/intel//compilers_and_libraries_2019.1.144/linux/mpi/intel64/lib/release/libmpi.so.12 (0x00007fbcc44d9000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fbcc40e8000)
    librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fbcc3ee0000)
    libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fbcc3cdc000)
    libfabric.so.1 => /opt/intel//compilers_and_libraries_2019.1.144/linux/mpi/intel64/libfabric/lib/libfabric.so.1 (0x00007fbcc3aa3000)
    libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fbcc388b000)
    /lib64/ld-linux-x86-64.so.2 (0x00007fbcc7968000)
    libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fbcc366c000)

My PATH looks good

which mpicc
/opt/intel//compilers_and_libraries_2019.1.144/linux/mpi/intel64/bin/mpicc

which mpirun
/opt/intel//compilers_and_libraries_2019.1.144/linux/mpi/intel64/bin/mpirun

I tried installing the suite both with and without IA-32 support but I get the same error regardless.

I have run this with source compiled mpich 3-3 (using GNU-8.2.0 compilers) and it works great so I think there is some issue in your hydra implementation.

Thanks for any support.

 

--Mike

How to add a new node to an installed Intel Parallel Studio cluster

$
0
0

Hello, I am running Intel parallel_studio_xe_2019_update1_cluster_edition on Linux with my student license and I have finished a cluster installation with specific nodes file. But now, my cluster is running and I need to add a node to the cluster without any effect to running node. (I mean no job paused). Can you help me to get that?

Can I install and make parallel studio xe cluster edition available on a cluster?

$
0
0

Hi,

I am using a free student version on a HPC linux cluster of an academic institute. Now other users from this institute also want to use the parallel studio xe package for their academic research.

I see the price for license on https://softwarestore.intel.com/SuiteSelection/ParallelStudio, but we are using the parallel studio (compilers and libraries) to study the software and do research work. There is no need for any support. In this case can I install the package and make it available to other users on the cluster?

Thanks in advance.

 

best regards,

Dr. Hong Li

integration problem between Torque 4 and Intel(R) MPI Library for Linux* OS, Version 2019 Update 1

$
0
0

Hi!

I have successfully compiled and linked a program with IntelMPI and if I run it interactively or in background it runs very fast and without any problems on our new server (ProLiant DL580 Gen10, 1 node with 4 processors with 18 cores each, total 72 cores, hyperthreading disabled). If I try to submit it by Torque (version 4) strange things happen, for example:

1) if I submit 2 jobs asking each 8 cores they are both fine

2) if I submit a third job (8 cores) it is 4 times slower becasue the 8 process runs on two cores!

3) if I submit a fourth job it runs properly, but if I qdel all the four jobs, all of them disappear from qstat -a but the fourth is keeping running!

From previous discussion I notice in this forum, I have the feeling it is an integration problem between intelmpi and torque, so I did the following:

 export I_MPI_PIN=off
 export I_MPI_PIN_DOMAIN=socket

to run the program I did the following call of mpirun:

/opt/intel/compilers_and_libraries_2019.1.144/linux/mpi/intel64/bin/mpirun -d -rmk pbs -bootstrap pbsdsh .................

I have checked and PBS_ENVIRONMENT is properly set to PBS_BATCH

Also torque configuration is apparently correct, the file

/var/lib/torque/server_priv/nodes contains the following line:

dscfbeta1.units.it np=72 num_node_boards=1

This is a severe problem for me, since the machine is shared so we do need a scheduler like torque (pbs) to run jobs compiled and linked to intelmpi. Any help suggestion is welcome!

thank you in advance

Mauro

Conda impi_rt=2019.1 doesn't substitute I_MPI_ROOT in bin/mpivars.sh

$
0
0

Not sure where to report this bug, but this forces me stick with intelpython 2018.0.3. The steps to reproduce are

conda config --add channels intel
conda create -n test impi_rt=2019.1

You will find /path/to/envs/test/bin/mpivars.sh has I_MPI_ROOT not substituted correctly.

Or is conda no longer the supported way to install Intel Performance Libraries? If so, what's the most future proof way? Or if it is the best way, where should I report this bug? Thanks.


integer overflow for MPI_COMM_WORLD ref-counting in MPI_Iprobe

$
0
0

Calling 2^31 times MPI_Iprobe results in the following error:

Abort(201962501) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Iprobe: Invalid communicator, error stack:
PMPI_Iprobe(123): MPI_Iprobe(src=MPI_ANY_SOURCE, tag=MPI_ANY_TAG, MPI_COMM_WORLD, flag=0x7ffd925056c0, status=0x7ffd92505694) failed
PMPI_Iprobe(90).: Invalid communicator

On our system it takes about 10 Minutes to perform this number of calls in a loop.

The affected version is IntelMPI 2019.1.144 (based on MPICH 3.3)
 

The expected behavior is that MPI_Iprobe is neutral for the reference count of the provided communicator. Especially for MPI_COMM_WORLD, the reference count is superflous.

Bad Termination Error Exit Code 4

$
0
0

Hi,

I have a binary which was compiled on Haswell using Intel 16.0 and IMPI 5.1.1. It runs fine on Haswell. But when I try to run it on Skylake nodes, the binary crashed right away with this error

==================================================================================

=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES

=   PID 99283 RUNNING AT iforge127

=   EXIT CODE: 4

=   CLEANING UP REMAINING PROCESSES

=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

I understand the issue may be with the application,  but would like to know how to debug this and resolve the issue. Thank you for the help.

Regards,

Intel MPI with Distributed Ansys Mechanical

$
0
0

Did anyone can share a successful story of running distributed Ansys (Mechanical) with intel MPI in windows 10 between two pcs?

A long story to short, I could launch a distributed analysis in a single pc with intel mpi, but couldn't launch a distributed analysis between two pcs, but IBM mpi does.

Here are what I did so far (and wish I can get guides from you)

Hardware: Two Dell workstations, same cpu, ram, and everything.

OS: windows 10

Intel mpi library: 2017 update 3

After installing Intel mpi library and finishing setting up environmental varies and catch password on each machine, I did a test "mpiexec -n 4 -ppn 2 -machine machines.txt test" , and get the following feedback which indicates intel mpi is communicated successfully between two pcs.

Hello world: rank 0 of 4 running on node1
Hello world: rank 1 of 4 running on node2
Hello world: rank 2 of 4 running on node1
Hello world: rank 3 of 4 running on node2

Did the same test on each pc with the command "ansys192 -np 2 -mpitest", and both pcs show "MPI Test has completed successfully!"

However, when I run the distributed test "ansys192 -machine machines.txt -mpitest", it looks like Ansys still takes the test as a single pc test, as the info shown below:

Mechanical APDL execution Command: mpiexec -np 2 -genvlist ANS_USER_PATH,ANSWAIT,ANSYS_SYSDIR,ANSYS_SYSDIR32,ANSYS192_DIR,ANSYSLI_RESERVE_ID,ANSYSLI_USAGE,AWP_LOCALE192,AWP_ROOT192,CADOE_DOCDIR192,CADOE_LIBDIR192,LSTC_LICENSE,P_SCHEMA,PATH,I_MPI_COLL_INTRANODE,I_MPI_AUTH_METHOD  -localroot "C:\Program Files\ANSYS Inc\v192\ANSYS\bin\winx64\MPITESTINTELMPI.EXE"  -machine machines.txt -mpitest

I appreciate all your feedback, Thank you! 

How should I edit machines.LINUX file for my cluster?

$
0
0

Hello everybody:

I am a new user for cluster, recently I updated intel composer XE 2013 to compile fortran,

I found in Readme.txt which says I need a machines.LINUX file to make sure I can use every node to run fortran program.

How should I edit the machines.LINUX file correctly? I had found some example, e.g.

BASH: cluster_prereq_is_remote_dir_mounted(): compute-11-37 <- /opt/intel -> compute-12-26
BASH: cluster_prereq_is_remote_dir_mounted(): compute-11-37 <- /opt/intel -> compute-12-27
BASH: cluster_prereq_is_remote_dir_mounted(): compute-11-37 <- /opt/intel -> compute-12-28
...

or

clusternode01

clusternode02

clusternode03

...

 

what is the format correct?I am very confuse about that, please help me, thanks so much!!

MPI Crashing

$
0
0

Hello,

I recently upgraded my os to Ubunto 18.04 and I have problems since.

Right now I reformatted my desktop and installed a fresh version of Ubuntu 18.04 and Installed intel C++ compiler and MPI library 2019 version 2.

When I run my codes, after a couple of hours and thousands of time steps I get the following error message:

 

Abort(873060101) on node 15 (rank 15 in comm 0): Fatal error in PMPI_Recv: Invalid communicator, error stack:
PMPI_Recv(171): MPI_Recv(buf=0x4b46a00, count=36912, MPI_DOUBLE, src=14, tag=25, MPI_COMM_WORLD, status=0x1) failed
PMPI_Recv(103): Invalid communicator
[cli_15]: readline failed

 

My code used to run fine on Ubuntu 16.04 (with older version of Intel's compiler and MPI), and also runs well on various big clusters.

My code uses Isend for sending information and Recv for reciving. Throughout my code I only use MPI_COMM_WORLD communicator and I never create a new one.

Can you pls help me find out what's wrong?

 

Thank you,

 

Elad

 

Viewing all 930 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>