Quantcast
Channel: Clusters and HPC Technology
Viewing all 930 articles
Browse latest View live

MPI_Allreduce is toooo slow

$
0
0

I use the benchmark in Intel MPI (IMB) to measure the performance of MPI_Allreudce over a rack with 25 machines equipped with Infiniband 40G switch. (I use the latest version of parallel studio 2017 on CentOS7 with linux kernel 3.1)

mpiexec.hydra -genvall -n 25 -machinefile ./machines ~/bin/IMB-MPI1 Allreduce -npmin 25 -msglog 26:29 -iter 1000,128

#------------------------------------------------------------
#    Intel (R) MPI Benchmarks 4.1 Update 1, MPI-1 part
#------------------------------------------------------------
# Date                  : Mon Feb 20 16:40:26 2017
# Machine               : x86_64
# System                : Linux
# Release               : 3.10.0-327.el7.x86_64
# Version               : #1 SMP Thu Nov 19 22:10:57 UTC 2015
# MPI Version           : 3.0

...

# /home/syko/Turbograph-DIST/linux_ver/bin//IMB-MPI1 Allreduce -npmin 25 -msglog 26:29 -iter 1000
#

# Minimum message length in bytes:   0
# Maximum message length in bytes:   536870912
#
# MPI_Datatype                   :   MPI_BYTE
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM
#
#

# List of Benchmarks to run:

# Allreduce

#----------------------------------------------------------------
# Benchmarking Allreduce
# #processes = 25
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.12         0.18         0.15
     67108864            2    298859.48    340774.54    329296.10
    134217728            1    619451.05    727700.95    687140.46
    268435456            1   1104426.86   1215415.00   1177512.81
    536870912            1   2217355.97   2396162.03   2331228.14

# All processes entering MPI_Finalize

So I conclude that the performance of MPI_Allreduce is about (# of bytes / sizeof (float) / ElapsedTime) ~= 57 Mega Elementes / sec

This throughput number is far below than I expected. The network bandwidth usage is also much lower than the maximum bandwidth of Infiniband. 
Is this performance number is acceptable in Intel MPI? Otherwise, is there something I can do to improve it?
(I tried varying 'I_MPI_ADJUST_ALLREDUCE', but were not satisfied.)

 

 


Trouble installing Windows MPI runtime environment

$
0
0

Hi,

I'm trying to install the Windows MPI Runtime Environment on a Windows 7 / x64 machine for use with some modelling software. I'm using version 5.1 Update 1 (or 5.1.1.110, which is not quite the latest, but is what the vendor of the modelling software says is required). The machine previously had a 4.x version, but this has been uninstalled.

When I run the installer it self-extracts, then I see an Intel splash screen for a couple of seconds, and then it vanishes and nothing further happens. There does not appear to be anything applicable in the windows event logs when this happens.

The PC in question is a university managed one. I have sufficient permissions to be able to use "Run as Adminsitrator", and this does not help. I guess it's possible that a group policy or other restriction is causing trouble; if so, then diagnosing this would be helpful so that I could take the evidence to IT.

Any help or advice would be appreciated!

Thanks

Simon.

Thread Topic: 

Help Me

How to use command to install Parallel studio cluster to a specific location

$
0
0

We ordered and installed parallel studio cluster 2017 for our new cluster. However, the cluster management software, Bright, requests to install compiler in a shared directory instead of default one (/opt/intel), or the compute nodes cannot find mpirun program. May I know how to change the target destination (like /cm/shared/apps) using commands? Any further help and suggestion will be highly appreciated.

Best,
Leon

Zone: 

Thread Topic: 

Help Me

mpirun seems to set GOMP_CPU_AFFINITY

$
0
0

It appears Intel MPI is setting GOMP_CPU_AFFINITY, why?  How do I prevent this?

When I print my env I get:

bash-4.2$ env | grep OMP
OMP_PROC_BIND=true
OMP_PLACES=threads
OMP_NUM_THREADS=2

When I mpirun env I see that GOMP_CPU_AFFINITY has been set for me, WHY?

bash-4.2$
bash-4.2$ mpirun -n 1 env | grep OMP
OMP_PROC_BIND=true
OMP_NUM_THREADS=2
OMP_PLACES=threads
GOMP_CPU_AFFINITY=0,1

The reason this is a problem is that I'm using OMP env vars to control affinity.  Observe:

bash-4.2$ env | grep I_MPI
I_MPI_PIN_DOMAIN=2:compact
I_MPI_FABRICS=shm:tmi
I_MPI_RESPECT_PROCESS_PLACEMENT=0
I_MPI_CC=icc
I_MPI_DEBUG=4
I_MPI_PIN_ORDER=bunch
I_MPI_PIN_RESPECT_CPUSET=off
I_MPI_ROOT=/opt/intel-mpi/2017

Why this is a problem:  I get a bunch of bizarre warnings about GOMP_CPU_AFFINITY, and invalid OS proc ID for the procs listed in GOMP_CPU_AFFINITY  like this:

bash-4.2$ mpirun -n 1 ./hello_mpi
OMP: Warning #181: OMP_PROC_BIND: ignored because GOMP_CPU_AFFINITY has been defined
OMP: Warning #181: OMP_PLACES: ignored because GOMP_CPU_AFFINITY has been defined
OMP: Warning #123: Ignoring invalid OS proc ID 1.

 hello from master thread
[0] MPI startup(): Multi-threaded optimized library
[0] MPI startup(): shm and tmi data transfer modes
[0] MPI startup(): Rank    Pid      Node name           Pin cpu
[0] MPI startup(): 0       81689    kit002.localdomain  {0,36}
hello_parallel.f: Number of tasks=  1 My rank=  0 My name=kit002.localdomain

I have a hybrid MPI/OpenMP code compiled with Intel 2017 and run with Intel MPI 2017, on a Linux cluster under SLURM.  The code has a simple OMP master region which prints hello from the master thread, then exits the parallel region and prints the number of ranks, which rank this is, and the host name for the node.  Simple stuff:

program hello_parallel

  ! Include the MPI library definitons:
  include 'mpif.h'

  integer numtasks, rank, ierr, rc, len, i
  character*(MPI_MAX_PROCESSOR_NAME) name

  !$omp master
   print*, "hello from master thread"
  !$omp end master

  ! Initialize the MPI library:
  call MPI_INIT(ierr)
  if (ierr .ne. MPI_SUCCESS) then
     print *,'Error starting MPI program. Terminating.'
     call MPI_ABORT(MPI_COMM_WORLD, rc, ierr)
  end if

  ! Get the number of processors this job is using:
  call MPI_COMM_SIZE(MPI_COMM_WORLD, numtasks, ierr)

  ! Get the rank of the processor this thread is running on.  (Each
  ! processor has a unique rank.)
  call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)

  ! Get the name of this processor (usually the hostname)
  call MPI_GET_PROCESSOR_NAME(name, len, ierr)
  if (ierr .ne. MPI_SUCCESS) then
     print *,'Error getting processor name. Terminating.'
     call MPI_ABORT(MPI_COMM_WORLD, rc, ierr)
  end if

  print "('hello_parallel.f: Number of tasks=',I3,' My rank=',I3,' My name=',A,'')",&
       numtasks, rank, trim(name)

  ! Tell the MPI library to release all resources it is using:
  call MPI_FINALIZE(ierr)

end program hello_parallel

Compiled simply:   mpiifort -g -qopenmp -o hello_mpi hello_mpi.f90

 

MPI Installer error

$
0
0

I have tried to run several MPI installers (w_mpi_p_2017.2.187.exe , w_mpi-rt_p5.1.3.180, w_mpi-rt_p_4.1.3.047) and all of them failed on my Windows 7 PC. The following error shows up:

Error while extracting file:

C:\...\autorun.inf

A file error occurred (The system cannot find the file specified.).

What can I do to solve this issue?

Trading App

$
0
0

I'll make a trading app that trades up to 7 assets concurrently (independently). The trading server is accessed through websockets. So each concurrent work/task should open/close a websocket connection every minute (1min trading).

Is Intel MPI a good choice? I have a quad-core/4 threads computer.

Thank you.

Thread Topic: 

Question

A problem in installing the Intel MPI Library w_mpi_p_2017.2.187.exe under Windows 10

$
0
0

A problem arise when I am trying to install the last version of Intel MPI Library w_mpi_p_2017.2.187.exe under Windows 10 both on Intel i7-6950X and Intel Core 2 Duo E6750 based PC. Two screenshots are available. My license is still valid.

The older versions of Intel MPI library:  w_mpi_p_5.1.3.180.exe, w_mpi_p_2017.0.109.exe and w_mpi_p_2017.1.143.exe have been successfully installed on these PC.

Please, help me to resolve the problem.

Andrei Voloschenko

Zone: 

Thread Topic: 

Bug Report

[Intel MPI benchmarks] Non-free license of JavaScript files

$
0
0

Intel MPI benchmarks is licensed under the Common Public License, which is approved both by FSF and OSI as a Free Software / Open Source license. However, while packaging Intel MPI benchmarks for Fedora, we found[1] a couple of files with non-free license headers in IMB_2017.tgz:

imb/imb/doc/IMB_Users_Guide/search.js:
WARNING: You must purchase a copy of FAR HTML v4 or greater to use this file.

imb/imb/doc/IMB_Users_Guide/tree.js:
Please don't use this file without purchasing FAR. http://helpware.net/FAR/

I am not familiar with FAR. I understand it is an HTML help authoring tool. It's probably unintentional for it to impose license restrictions on output generated by it, but to be safe the HTML documentation has been removed from the tarball for Fedora.

Would you, developer(s) of IMB, please make sure all files are distributable under a Free Software / Open Source license in a future release?

Thanks,
Michal Schmidt

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1425960#c2

Thread Topic: 

Bug Report

mpi2017 update2 segfault at libmpifort.so.12.0 when mpirun test

$
0
0

hello:

  After install parallel_studio_xe_2017.2.050, i try to test mpi like this:

in dir  "/opt/intel/compilers_and_libraries_2017.2.174/linux/mpi/test"      

use cmd :  "mpiicc test.c -o test "

then :"mpirun -n 2 ./test "

but, segfault  hanppend:

[ 8671.743574] test[1229]: segfault at 1 ip 00007ff2fdc59f45 sp 00007fff51885050 error 4 in libmpifort.so.12.0[7ff2fdb27000+17e000]

my systerm info is:

# echo $PATH
/opt/intel/compilers_and_libraries_2017.2.174/linux/mpi/intel64/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/intel/compilers_and_libraries_2017.2.174/linux/bin/intel64/

 

# file test
test: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, BuildID[sha1]=2c11b6ba5907ec05d55bac421c90b4b3984b05e5, not stripped

 

# cpuinfo 
Intel(R) processor family information utility, Version 2017 Update 2 Build 20170125 (id: 16752)
Copyright (C) 2005-2017 Intel Corporation.  All rights reserved.

=====  Processor composition  =====
Processor name    : Intel(R) Xeon(R)  E5-2660 0 
Packages(sockets) : 1
Cores             : 8
Processors(CPUs)  : 8
Cores per package : 8
Threads per core  : 1

=====  Processor identification  =====
Processor       Thread Id.      Core Id.        Package Id.
0               0               0               0   
1               0               1               0   
2               0               2               0   
3               0               3               0   
4               0               4               0   
5               0               5               0   
6               0               6               0   
7               0               7               0   
=====  Placement on packages  =====
Package Id.     Core Id.        Processors
0               0,1,2,3,4,5,6,7         0,1,2,3,4,5,6,7

=====  Cache sharing  =====
Cache   Size            Processors
L1      32  KB          no sharing
L2      256 KB          no sharing
L3      20  MB          (0,1,2,3,4,5,6,7)

 

 

name -a
Linux 3.10.55-EMBSYS-CGELyocto-standard #7 SMP Tue Mar 15 19:54:08 CST 2016 x86_64 GNU/Linu

 

Thread Topic: 

Question

mpi2017 update2 segfault at libmpifort.so.12.0 when mpirun test

$
0
0

hello:

  After install parallel_studio_xe_2017.2.050, i try to test mpi like this:

in dir  "/opt/intel/compilers_and_libraries_2017.2.174/linux/mpi/test"      

use cmd :  "mpiicc test.c -o test "

then :"mpirun -n 2 ./test "

but, segfault  hanppend:

[ 8671.743574] test[1229]: segfault at 1 ip 00007ff2fdc59f45 sp 00007fff51885050 error 4 in libmpifort.so.12.0[7ff2fdb27000+17e000]

my systerm info is:

# echo $PATH
/opt/intel/compilers_and_libraries_2017.2.174/linux/mpi/intel64/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/intel/compilers_and_libraries_2017.2.174/linux/bin/intel64/

 

# file test
test: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, BuildID[sha1]=2c11b6ba5907ec05d55bac421c90b4b3984b05e5, not stripped

 

# cpuinfo 
Intel(R) processor family information utility, Version 2017 Update 2 Build 20170125 (id: 16752)
Copyright (C) 2005-2017 Intel Corporation.  All rights reserved.

=====  Processor composition  =====
Processor name    : Intel(R) Xeon(R)  E5-2660 0 
Packages(sockets) : 1
Cores             : 8
Processors(CPUs)  : 8
Cores per package : 8
Threads per core  : 1

=====  Processor identification  =====
Processor       Thread Id.      Core Id.        Package Id.
0               0               0               0   
1               0               1               0   
2               0               2               0   
3               0               3               0   
4               0               4               0   
5               0               5               0   
6               0               6               0   
7               0               7               0   
=====  Placement on packages  =====
Package Id.     Core Id.        Processors
0               0,1,2,3,4,5,6,7         0,1,2,3,4,5,6,7

=====  Cache sharing  =====
Cache   Size            Processors
L1      32  KB          no sharing
L2      256 KB          no sharing
L3      20  MB          (0,1,2,3,4,5,6,7)

 

 

name -a
Linux 3.10.55-EMBSYS-CGELyocto-standard #7 SMP Tue Mar 15 19:54:08 CST 2016 x86_64 GNU/Linu

 

Thread Topic: 

Question

MPI/openMP segmentation error

$
0
0

Hello,

when I run my program on two nodes, and set the number of threads larger than 1 (i.e., 2-16 threads), I've encountered the segmentation error:

forrtl: severe (174): SIGSEGV, segmentation fault occurred

if the openMP directives are commented (still using MPI) the error is gone.

any comment/advice is appreciated!!

 

Abaqus with Omnipath

$
0
0

I am trying to get Abaqus running over an Omnipath fabric.

Abaqus version 6.14-1  and using Intel MPI 5.1.2

In my abaqus_v.env file I set  mp_mpirun_options   -v -genv I_MPI_FABRICS shm:tmi

By the way the -PSM2 argument is nt accepted

I cannot cut and paste the output here (Argggh!) so have attached a rather long output file.

I do not know where the wheels are coming off here. Pun intended as this is the e1 car crash simulation.

I got lots of messages about PMI buffer overrruns but I am not sure that is the root of the problem

 

 

AttachmentSize
Downloadtext/plainout.txt3.13 MB

Thread Topic: 

Help Me

Irqbalance update affinity_hint and banned cpus conflict

$
0
0

I have a cluster with Ominpath nodes and CentOS 7.2.

On Sunday I upgraded the irqbalance package to irqbalance-1.0.7-6

The system is now repeatedly logging /usr/sbin/irqbalance: irq NN affinity_hint and banned cpus conflict

In /etc/sysconfig/irqbalance I have set IRQBALANCE_ARGS=--hintpolicy=exact

Looking at Redhat bugzilla, this seems to be a know issue. Has anyone seen it and is there a known workaround?

Nested MPI process failed with GetParentPort error

$
0
0

Hello,

we are trying to execute a tricky code with intel mpi 5.0.2.044.

We have a main program executed via

mpirun -n 1 Exec_main

This main program makes system call to an external shell script that does

mpirun -genv I_MPI_DEBUG 5 -n 4 ExecSub

When executing the external shell manually, everything is fine.

When launching the shell from the main executable, we have the following error :

[0] MPI startup(): Multi-threaded optimized library
[2] MPI startup(): shm and tcp data transfer modes
[3] MPI startup(): shm and tcp data transfer modes
[0] MPI startup(): shm and tcp data transfer modes
[1] MPI startup(): shm and tcp data transfer modes
Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(784).......:
MPID_Init(1452).............: spawn process group was unable to obtain parent port name from the channel
MPIDI_CH3_GetParentPort(369):  PMI2 KVS_Get failed: -1
Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(784).......:
MPID_Init(1452).............: spawn process group was unable to obtain parent port name from the channel
MPIDI_CH3_GetParentPort(369):  PMI2 KVS_Get failed: -1
Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(784).......:
MPID_Init(1452).............: spawn process group was unable to obtain parent port name from the channel
MPIDI_CH3_GetParentPort(369):  PMI2 KVS_Get failed: -1
Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(784).......:
MPID_Init(1452).............: spawn process group was unable to obtain parent port name from the channel
MPIDI_CH3_GetParentPort(369):  PMI2 KVS_Get failed: -1

Do you have any clue on how to understand and solve this issue ?

Best regards

 

Zone: 

Thread Topic: 

Bug Report

What could be a cause of the MPI program to run slower on multinode cluster?

$
0
0

Hello everyone,

I've got a question about MPI program performance: I've developed an MPI program that processes large amounts of data (about 10^9) elements, and running this program I've noticed that as many processes I create using mpiexec utility as longer the duration of the program execution. What could be a cause of the following issue ?? When I run this program in a single computational node, it works faster rather running that using two computational nodes. Please, help.

Regards, Arthur.


Which logical core does MPI background threads are pinned?

$
0
0

I program in C++ which parallelizes using both MPI and OpenMP (1 MPI process per machine, Multiple threads are exploited by each MPI process)

I carefully pin the worker threads of OpenMP and others in my application. I know that MPI spawns some background threads for its internal uses. But I don't know to which logical cores the threads are attached.

How can I find and control it?

Thanks

Different environment variable on different compute node

$
0
0

I have two computers, the value of environment variable on different computer is not the same. For example, TMP is set to "C:\tmp" on compute 1, and TMP is set to "D:\Temp_Dir" on computer 2. When I run getenv( "TMP") API to get environment variable, the returned value on all the process are always "C:\tmp" which is the value of computer 1. 

How can my code  returns "D:\Temp_Dir" on computer 2?

Thanks,

Yongjun

 

Thread Topic: 

Help Me

Hybrid MPI/OpenMP showing poor performance when using all cores on a socket vs. N-1 cores per socket

$
0
0

Hi,

I'm running a hydrid MPI/OpenMP application on Intel® E5-2600v3 (Haswell) Series cores and I notice a drop in performance by 40% when using all cores (N) on a socket vs. N-1 cores on a socket. This behavior is pronounced with higher core counts >= 160 cores. The cluster is built using 2 CPUs per node. As a test case I tried a similar run on Intel® E5-2600 Series (Sandy-Bridge) and I don't see this behavior and the performance is comparable.

I'm using Intel MPI 5.0. Both the clusters use the same IB hardware. Profiling revealed MPI time is what is causing the performance drop. The application only performs MPI communication outside OpenMP regions. Any help will be appreciated.

Thanks,

GK

MPI bad termination error

$
0
0

Hi,

I have trouble getting to run the test program linux/mpi/test/test.c included in the Intel MPI package.

My trouble only occurs on machines equipped with AMD processors.

Specially, I've installed MPI in my home directory (which is NSF-mounted on both host1 and host2 machines) and then compiled test.c using mpicc 

mpicc -show -o test test.c
gcc -o 'test''test.c' -I/home/jyli/intel/compilers_and_libraries_2017.2.174/linux/mpi/intel64/include -L/home/jyli/intel/compilers_and_libraries_2017.2.174/linux/mpi/intel64/lib/release_mt -L/home/jyli/intel/compilers_and_libraries_2017.2.174/linux/mpi/intel64/lib -Xlinker --enable-new-dtags -Xlinker -rpath -Xlinker /home/jyli/intel/compilers_and_libraries_2017.2.174/linux/mpi/intel64/lib/release_mt -Xlinker -rpath -Xlinker /home/jyli/intel/compilers_and_libraries_2017.2.174/linux/mpi/intel64/lib -Xlinker -rpath -Xlinker /opt/intel/mpi-rt/2107.0.0/intel64/lib/release_mt -Xlinker -rpath -Xlinker /opt/intel/mpi-rt/2017.0.0/intel64/lib -lmpifort -lmpi -lmpigi -ldl -lrt -lpthread

I checked that the machines connect to each other fine

mpirun -ppn 1 -n 2 -hosts host1,host2 hostname
host1
host2

However, when I run the test program, I encountered the following errors:

mpirun -ppn 1 -n 2 -hosts host1,host2 ./test

[0] MPI startup(): Intel(R) MPI Library, Version 2017 Update 2  Build 20170125 (id: 16752)
[0] MPI startup(): Copyright (C) 2003-2017 Intel Corporation.  All rights reserved.
[0] MPI startup(): Multi-threaded optimized library

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 5582 RUNNING AT host2
=   EXIT CODE: 11
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 5582 RUNNING AT host2
=   EXIT CODE: 11
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
   Intel(R) MPI Library troubleshooting guide:
https://software.intel.com/node/561764

===================================================================================

I then attached gdb to the core file generated, and here's the trace

gdb ./test core
(gdb) bt
#0  __GI_____strtol_l_internal (nptr=0x0, endptr=0x0, base=10, group=<optimized out>, loc=0x7faf932e2060 <_nl_global_locale>)
    at ../stdlib/strtol_l.c:298
#1  0x00007faf9368e11a in atoi (__nptr=<optimized out>) at /usr/include/stdlib.h:286
#2  i_mpi_numa_nodes_compare (a=0x0, b=0x0) at ../../src/mpid/ch3/src/mpid_init.c:62
#3  0x00007faf92f5b419 in msort_with_tmp (p=0x7fff5c532aa0, b=0xac7b30, n=2) at msort.c:83
#4  0x00007faf92f5b6cc in msort_with_tmp (n=2, b=0xac7b30, p=0x7fff5c532aa0) at msort.c:45
#5  __GI_qsort_r (b=0xac7b30, n=2, s=8, cmp=0x7faf9368e100 <i_mpi_numa_nodes_compare>, arg=<optimized out>) at msort.c:297
#6  0x00007faf936911af in MPID_nem_impi_create_numa_nodes_map () at ../../src/mpid/ch3/src/mpid_init.c:1305
#7  0x00007faf93692284 in MPID_Init (argc=0x0, argv=0x0, requested=10, provided=0x0, has_args=0x7faf932e2060 <_nl_global_locale>, has_env=0xac7db1)
    at ../../src/mpid/ch3/src/mpid_init.c:1732
#8  0x00007faf9362872b in MPIR_Init_thread (argc=0x0, argv=0x0, required=10, provided=0x0) at ../../src/mpi/init/initthread.c:717
#9  0x00007faf93615e2b in PMPI_Init (argc=0x0, argv=0x0) at ../../src/mpi/init/init.c:253
#10 0x0000000000400a6e in main ()
(gdb) 

Does anyone have some clue as to what is going on? Thank you very much in advance!

Jenny

MPI_File_write_shared problem

$
0
0

Hi,

there seems to be a problem with some versions of intelmpi and file access with mpi shared file pointers. The files are not written correctly. We are using intelmpi 2017 on a cluster with gpfs filesystem. The linux kernel version is 3.10.0-327.36.3.el7.x86_64

Here is a code that reproduce the problem.

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

int main(int argc, char *argv[])
{
  char string[256];
  char file_name[] = "output";
  int count, slength;
  int open_error;
  int rank;
  MPI_File fh;
  MPI_Status status;

  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

  sprintf(string,"Rank : %d\n",rank);
  slength=strlen(string);

  open_error = MPI_File_open(MPI_COMM_WORLD, file_name, MPI_MODE_CREATE | MPI_MODE_WRONLY, MPI_INFO_NULL, &fh);
  if(open_error!=MPI_SUCCESS)
  {
    fprintf(stderr,"Error opening file\n");
    MPI_Abort(MPI_COMM_WORLD,open_error);
  }
  MPI_File_write_shared(fh, string, slength, MPI_CHAR, &status);
  MPI_Get_count(&status,MPI_CHAR,&count);
  if(slength!=count)
  {
    fprintf(stderr,"rank %d: slength=%d , count=%d \n", rank, slength, count);
  }

  MPI_File_close(&fh);
  MPI_Finalize();
  return 0;
}

One example of output is:

$ mpirun -np 10 ./mpi_shared ; cat output | sort -n -k 3
Rank : 2
Rank : 8
Rank : 9

i.e files are truncated.

Instead with the options:

I_MPI_EXTRA_FILESYSTEM=on
I_MPI_EXTRA_FILESYSTEM_LIST=gpfs

the mpi api seems to work, at least with this use case.

I also tested this issue with intelmpi 5.0.3, 5.1.1 and 5.1.3 obtaining the same results as of intel 2017.

It seems that, if the filesystem is not explicitly specified by means of the I_MPI_EXTRA_FILESYSTEM variables the semantic of MPI_File_write_shared is not compliant with the mpi standard.

Is it correct for you ?

Thanks.

Stefano

 

 

 

Zone: 

Thread Topic: 

Bug Report
Viewing all 930 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>