Page 1 of 1

MPI + OpenMP in yambo

Posted: Sun May 31, 2020 10:51 am
by waynebeibei
Dear all,

I am trying to use the a parallel GW job using Yambo 4.5.1. As my system is huge, i adopted mpi+ openmp with 256 cores (=4 nodes * 64 cores) but the jobs always failed.
Each node has 256 GB memory.

I used the parallelization setting as below:

'X_Threads= 16 # [OPENMP/X] Number of threads for response functions
DIP_Threads= 16 # [OPENMP/X] Number of threads for dipoles
SE_Threads= 16 # [OPENMP/GW] Number of threads for self-energy
X_CPU= "1 1 2 4 2" # [PARALLEL] CPUs for each role
X_ROLEs= "q g k c v" # [PARALLEL] CPUs roles (q,g,k,c,v)
X_nCPU_LinAlg_INV= 32 # [PARALLEL] CPUs for Linear Algebra
DIP_CPU= "2 4 2" # [PARALLEL] CPUs for each role
DIP_ROLEs= "k c v" # [PARALLEL] CPUs roles (k,c,v)
SE_CPU= "1 2 8" # [PARALLEL] CPUs for each role
SE_ROLEs= "q qp b" # [PARALLEL] CPUs roles (q,qp,b)
'

The LOG file shows the job stopped with
' <04s> P1: [01] CPU structure, Files & I/O Directories
<04s> P1-n1303.para.bscc: CPU-Threads:256(CPU)-1(threads)-16(threads@X)-16(threads@DIP)-16(threads@SE)
<04s> P1-n1303.para.bscc: CPU-Threads:DIP(environment)-2 4 2(CPUs)-k c v(ROLEs)
<04s> P1-n1303.para.bscc: CPU-Threads:X(environment)-1 1 2 4 2(CPUs)-q g k c v(ROLEs)
<04s> P1-n1303.para.bscc: CPU-Threads:SE(environment)-1 2 8(CPUs)-q qp b(ROLEs)
<04s> P1-n1303.para.bscc: [02] CORE Variables Setup
<04s> P1-n1303.para.bscc: [02.01] Unit cells
<06s> P1-n1303.para.bscc: [02.02] Symmetries
<06s> P1-n1303.para.bscc: [02.03] RL shells
<06s> P1-n1303.para.bscc: [02.04] K-grid lattice
<06s> P1-n1303.para.bscc: Grid dimensions : 4 4 2
<06s> P1-n1303.para.bscc: [02.05] Energies [ev] & Occupations
<06s> P1-n1303.para.bscc: [WARNING][X] Metallic system
<11s> P1-n1303.para.bscc: [03] Transferred momenta grid
<11s> P1-n1303.para.bscc: [04] Dipoles
<11s> P1-n1303.para.bscc: [DIP] Checking dipoles header
<12s> P1-n1303.para.bscc: [WARNING] DIPOLES database not correct or not present
<12s> P1-n1303.para.bscc: DIPOLES parallel ENVIRONMENT is incomplete. Switching to defaults
<12s> P1-n1303.para.bscc: [PARALLEL DIPOLES for K(ibz) on 1 CPU] Loaded/Total (Percentual):20/20(100%)
<12s> P1-n1303.para.bscc: [PARALLEL DIPOLES for CON bands on 2 CPU] Loaded/Total (Percentual):182/364(50%)
<12s> P1-n1303.para.bscc: [PARALLEL DIPOLES for VAL bands on 128 CPU] Loaded/Total (Percentual):2/176(1%)
<12s> P1-n1303.para.bscc: [x,Vnl] computed using 600 projectors
<12s> P1-n1303.para.bscc: [WARNING] [x,Vnl] slows the Dipoles computation. To neglect it rename the ns.kb_pp file
<12s> P1-n1303.para.bscc: [MEMORY] Alloc kbv( 4.178512Gb) TOTAL: 4.429481Gb (traced) 10.06800Mb (memstat)
<12s> P1-n1303.para.bscc: Dipoles: P, V and iR (T): | | [000%] --(E) --(X)
<12s> P1-n1303.para.bscc: Reading kb_pp_pwscf_fragment_1
<12s> P1-n1303.para.bscc: [PARALLEL distribution for Wave-Function states] Loaded/Total(Percentual):500/500(100%)
'

It may be related the memory issue? Any wrong about my parallelization setting? Anyone can help me to fix it?

The related files are attached.

Many thanks.

Re: MPI + OpenMP in yambo

Posted: Sun May 31, 2020 3:40 pm
by Daniele Varsano
Dear waynebeibei,
Most probably it is a memory problem, as you have 4Gb per core available and the code is allocating a bit more than that:

Code: Select all

 [MEMORY] Alloc kbv( 4.178512Gb) TOTAL:  4.429481Gb (traced)  10.06800Mb (memstat)
What you can do is to try to move all the cpu's on bands (c,v) as it allows to distribute the memory.
Note that in the calculation something went wrong on the parallel setting of the dipoles:

Code: Select all

DIPOLES parallel ENVIRONMENT is incomplete. Switching to defaults
Probably this happens because you are using 16 tasks x 4 nodes = 64 tasks, next you launch the code with a -n 256

In order to be consistent, you can try to set:
DIP_CPU= "1 16 4"
X_CPU= "1 1 1 16 4"

running the code using:
srun -n 64 ...

From the report I can see that your system is a metal, is this expected?

Best,
Daniele

Re: MPI + OpenMP in yambo

Posted: Mon Jun 01, 2020 8:46 am
by waynebeibei
Dear Dr. Daniele Varsano

Thanks for your prompt reply. As your suggestion, I now try to use :
'X_Threads= 1 # [OPENMP/X] Number of threads for response functions
DIP_Threads= 1 # [OPENMP/X] Number of threads for dipoles
SE_Threads= 1 # [OPENMP/GW] Number of threads for self-energy
X_CPU= "1 1 1 16 4" # [PARALLEL] CPUs for each role
X_ROLEs= "q g k c v" # [PARALLEL] CPUs roles (q,g,k,c,v)
X_nCPU_LinAlg_INV=1 # [PARALLEL] CPUs for Linear Algebra
DIP_CPU= "1 16 4" # [PARALLEL] CPUs for each role
DIP_ROLEs= "k c v" # [PARALLEL] CPUs roles (k,c,v)
SE_CPU= "1 16 4" # [PARALLEL] CPUs for each role
SE_ROLEs= "q qp b" # [PARALLEL] CPUs roles (q,qp,b)
' in yambo.in and

'#!/bin/bash
#SBATCH -p amd_256
#SBATCH -N 4
#SBATCH --tasks-per-node=16
#SBATCH --cpus-per-task=1

module load mpi/intel/18.0.2-thc
module load intel/18.0.2-thc
module load mpi/openmpi/3.1.3-icc18-cjj

srun -n 64 /public1/home/sc50136/software/yambo/yambo-4.5.1/bin/yambo
' in job.sh. The job indeed run smoothly with the output :
' <03s> P1: [01] CPU structure, Files & I/O Directories
<03s> P1-j2405.para.bscc: CPU-Threads:64(CPU)-64(threads)-1(threads@X)-1(threads@DIP)-1(threads@SE)
<03s> P1-j2405.para.bscc: CPU-Threads:DIP(environment)-1 16 4(CPUs)-k c v(ROLEs)
<03s> P1-j2405.para.bscc: CPU-Threads:X(environment)-1 1 1 16 4(CPUs)-q g k c v(ROLEs)
<03s> P1-j2405.para.bscc: CPU-Threads:SE(environment)-1 16 4(CPUs)-q qp b(ROLEs)
<03s> P1-j2405.para.bscc: [02] CORE Variables Setup
<03s> P1-j2405.para.bscc: [02.01] Unit cells
<04s> P1-j2405.para.bscc: [02.02] Symmetries
<04s> P1-j2405.para.bscc: [02.03] RL shells
<04s> P1-j2405.para.bscc: [02.04] K-grid lattice
<04s> P1-j2405.para.bscc: Grid dimensions : 4 4 2
<04s> P1-j2405.para.bscc: [02.05] Energies [ev] & Occupations
<04s> P1-j2405.para.bscc: [WARNING][X] Metallic system
<09s> P1-j2405.para.bscc: [03] Transferred momenta grid
<09s> P1-j2405.para.bscc: [04] Dipoles
<09s> P1-j2405.para.bscc: [DIP] Checking dipoles header
<09s> P1-j2405.para.bscc: [WARNING] DIPOLES database not correct or not present
<09s> P1-j2405.para.bscc: [PARALLEL DIPOLES for K(ibz) on 1 CPU] Loaded/Total (Percentual):20/20(100%)
<10s> P1-j2405.para.bscc: [PARALLEL DIPOLES for CON bands on 16 CPU] Loaded/Total (Percentual):23/364(6%)
<10s> P1-j2405.para.bscc: [PARALLEL DIPOLES for VAL bands on 4 CPU] Loaded/Total (Percentual):44/176(25%)
<10s> P1-j2405.para.bscc: [x,Vnl] computed using 600 projectors
<10s> P1-j2405.para.bscc: [WARNING] [x,Vnl] slows the Dipoles computation. To neglect it rename the ns.kb_pp file
<10s> P1-j2405.para.bscc: [MEMORY] Alloc kbv( 4.178512Gb) TOTAL: 4.429481Gb (traced) 10.42400Mb (memstat)
<10s> P1-j2405.para.bscc: Dipoles: P, V and iR (T): | | [000%] --(E) --(X)
<10s> P1-j2405.para.bscc: Reading kb_pp_pwscf_fragment_1
<10s> P1-j2405.para.bscc: [PARALLEL distribution for Wave-Function states] Loaded/Total(Percentual):500/500(100%)
<10s> P1-j2405.para.bscc: Reading wf_fragments_1_1
<04m-03s> P1-j2405.para.bscc: Dipoles: P, V and iR (T): |# | [002%] 03m-53s(E) 02h-35m-22s(X)
<07m-37s> P1-j2405.para.bscc: Dipoles: P, V and iR (T): |## | [005%] 07m-27s(E) 02h-29m-01s(X)
<07m-37s> P1-j2405.para.bscc: Reading kb_pp_pwscf_fragment_2
<11m-00s> P1-j2405.para.bscc: Dipoles: P, V and iR (T): |### | [007%] 10m-50s(E) 02h-24m-33s(X)
<14m-20s> P1-j2405.para.bscc: Dipoles: P, V and iR (T): |#### | [010%] 14m-10s(E) 02h-21m-46s(X)
<14m-21s> P1-j2405.para.bscc: Reading kb_pp_pwscf_fragment_3
<18m-11s> P1-j2405.para.bscc: Dipoles: P, V and iR (T): |##### | [012%] 18m-01s(E) 02h-24m-12s(X)
<21m-33s> P1-j2405.para.bscc: Dipoles: P, V and iR (T): |###### | [015%] 21m-22s(E) 02h-22m-33s(X)
<21m-33s> P1-j2405.para.bscc: Reading kb_pp_pwscf_fragment_4
<25m-14s> P1-j2405.para.bscc: Dipoles: P, V and iR (T): |####### | [017%] 25m-04s(E) 02h-23m-14s(X)
<29m-56s> P1-j2405.para.bscc: Dipoles: P, V and iR (T): |######## | [020%] 29m-45s(E) 02h-28m-48s(X)
'.

But in my understanding, the above setting is for pure MPI parallel without open-mpi, right? How to adopt a hybrid approach with MPI & openMP for multi nodes? For example, if I set the yambo.in as following
'X_Threads= 4 # [OPENMP/X] Number of threads for response functions
DIP_Threads= 4 # [OPENMP/X] Number of threads for dipoles
SE_Threads= 4 # [OPENMP/GW] Number of threads for self-energy
X_CPU= "1 1 1 4 1" # [PARALLEL] CPUs for each role
X_ROLEs= "q g k c v" # [PARALLEL] CPUs roles (q,g,k,c,v)
X_nCPU_LinAlg_INV=1 # [PARALLEL] CPUs for Linear Algebra
DIP_CPU= "1 4 1" # [PARALLEL] CPUs for each role
DIP_ROLEs= "k c v" # [PARALLEL] CPUs roles (k,c,v)
SE_CPU= "1 4 1" # [PARALLEL] CPUs for each role
SE_ROLEs= "q qp b" # [PARALLEL] CPUs roles (q,qp,b)
', how to set the VALUES in job.sh:
'#!/bin/bash
#SBATCH -p amd_256
#SBATCH -N 4
#SBATCH --tasks-per-node=??
#SBATCH --cpus-per-task=??
module load mpi/intel/18.0.2-thc
module load intel/18.0.2-thc
module load mpi/openmpi/3.1.3-icc18-cjj
export OMP_NUM_THREADS=??
srun -n 64 /public1/home/sc50136/software/yambo/yambo-4.5.1/bin/yambo
'.

Many thanks for your help.

Re: MPI + OpenMP in yambo

Posted: Mon Jun 01, 2020 9:16 am
by Daniele Varsano
Dear waynebeibei ,

in order to use hybrid approach just set the number of the threads in the input keywords e.g. X_Threads=nthreads etc.
and in the script:
#SBATCH --cpus-per-task=nthreads
export OMP_NUM_THREADS=nthreads

The srun has to be launched with the -n ntasks option.

Here an example:
'X_Threads= 4 # [OPENMP/X] Number of threads for response functions
DIP_Threads= 4 # [OPENMP/X] Number of threads for dipoles
SE_Threads= 4 # [OPENMP/GW] Number of threads for self-energy
X_CPU= "1 1 1 16 1" # [PARALLEL] CPUs for each role
X_ROLEs= "q g k c v" # [PARALLEL] CPUs roles (q,g,k,c,v)
X_nCPU_LinAlg_INV=1 # [PARALLEL] CPUs for Linear Algebra
DIP_CPU= "1 16 1" # [PARALLEL] CPUs for each role
DIP_ROLEs= "k c v" # [PARALLEL] CPUs roles (k,c,v)
SE_CPU= "1 16 1" # [PARALLEL] CPUs for each role
SE_ROLEs= "q qp b" # [PARALLEL] CPUs roles (q,qp,b)', how to set the VALUES in job.sh:

'#!/bin/bash
#SBATCH -p amd_256
#SBATCH -N 4
#SBATCH --tasks-per-node=4
#SBATCH --cpus-per-task=4
module load mpi/intel/18.0.2-thc
module load intel/18.0.2-thc
module load mpi/openmpi/3.1.3-icc18-cjj
export OMP_NUM_THREADS=4
srun -n 16 /public1/home/sc50136/software/yambo/yambo-4.5.1/bin/yambo'.

This is a job for 4 nodes a total of 16MPI and 4 thread per task.

Best,
Daniele