Strange behaviour when changing X_all_q parallelization

ncolonna · Post by **ncolonna** » Tue Aug 28, 2018 11:47 am

Dear Yambo community,

I'm new to Yambo. I've downloaded and successfully installed yambo-4.2.3 linking against
intel2018 netcdf/4.6.1 hdf5/1.10.1 netcdf-fortran/4.4.4 [see setup and report in the tar.gz file
available from the link at the end of the post (I was not able to upload the file "Sorry, the board attachment quota has been reached")]

I was running a G0W0 calculation for Silicon and I noticed a very strange behaviour when changing the parallelization
strategy for the X. I run different calculations changing the number of MPI processes for the parallelization over
the Brillouin zone and the conduction bands:
X_all_q_CPU= "1 $nk $nc 1" # [PARALLEL] CPUs for each role
X_all_q_ROLEs= "q k c v" # [PARALLEL] CPUs roles (q,k,c,v)
I have pure MPI runs wiht 28 processes splitted on k-points and conduction bands (the node I'm running on is a 2
Intel Broadwell processors with 14 cores each (28 cores per machine)):
nk=1 nc=28
nk=2 nc=14
nk=4 nc=7
nk=7 nc=4
nk=14 nc=2
nk=28 nc=1
(and ncpu=1 nk=1 nc=1 as a reference)

Some of these runs end successfully (nk=1,2,4).
The nk=7 and nk=14 get stuck somewhere: the slurm job is still running; logging into the nodes I see that all the CPU are 100%,
and there is no memory issue; I tried to debug with dbg (on one particular instance of yambo.x) and from what I understood
the code is stuck on some MPI call
Only the nk=28 run ends with the error:
[ERROR] STOP signal received while in :[04] Dynamic Dielectric Matrix (PPA)
[ERROR] File ./run//ndb.dip_iR_and_P; Variable NOT DEFINED; NetCDF: HDF error

Even more strange is the fact that the successful runs give different results for the QP energies (see qp_report.dat)

I'm pretty sure I didn't mess up with the database (I created a different folder for each run).
I also did a serial run for reference (with a version of the code compiled without MPI and openMP).
Some of the successful runs give the same results as the serial one (interestingly all the run with
no parallelization over k...). See qp_report.dat

It would be great if you could have a look and tell me what you think about.

Thank you,

Nicola S. Colonna
Post-doctoral Research Scientist
THEOS STI IMX EPFL
ME D2 1426
Station 9
CH-1015 Lausanne (Switzerland)

Link to the tar file (contains the outputs of all the runs, the submission script and the config files):
https://drive.google.com/open?id=1nB4Er ... guVrxgmR-d

Post by **Daniele Varsano** » Tue Aug 28, 2018 12:02 pm

Dear Nicola,
thank you very much for reporting, we will have a careful look.
In order to reproduce the errors and problems, if needed, could you also post your QE input files and pseudopotentials.

Thanks a lot,

Daniele

ncolonna · Post by **ncolonna** » Tue Aug 28, 2018 3:29 pm

Dear Daniele,

thanks for the very fast reply!
Here:
https://drive.google.com/open?id=1ZyRKH ... TWPOh33gGm
the PWSCF inputs and Pseudopotential.

Best,

Nicola

Yambo Community Forum

Strange behaviour when changing X_all_q parallelization

Strange behaviour when changing X_all_q parallelization

Re: Strange behaviour when changing X_all_q parallelization

Re: Strange behaviour when changing X_all_q parallelization