YAMBO parallel for large system

Post by jyin002 » Wed Sep 14, 2016 8:22 am

Dear YAMBO developers,
I was trying to use GW0 method to calculate the corrected electronic bands of solids implemented in YAMBO (v 3.4.2). Please check the GW input file as below:
gw0 # [R GW] GoWo Quasiparticle energy levels
ppa # [R Xp] Plasmon Pole Approximation
em1d # [R Xd] Dynamical Inverse Dielectric Matrix
HF_and_locXC # [R XX] Hartree-Fock Self-energy and Vxc
EXXRLvcs= 20 Ry # [XX] Exchange RL components
Chimod= "Hartree" # [X] IP/Hartree/ALDA/LRC/BSfxc
% QpntsRXp
1 | 34 | # [Xp] Transferred momenta
% BndsRnXp
1 | 280 | # [Xp] Polarization function bands
NGsBlkXp= 1 Ry # [Xp] Response block size
% LongDrXp
1.000000 | 0.000000 | 0.000000 | # [Xp] [cc] Electric Field
PPAPntXp= 27.21138 eV # [Xp] PPA imaginary energy
% GbndRnge
1 | 280 | # [GW] G[W] bands range
GDamping= 0.100000 eV # [GW] G[W] damping
dScStep= 0.100000 eV # [GW] Energy step to evalute Z factors
DysSolver= "n" # [GW] Dyson Equation solver (`n`,`s`,`g`)
%QPkrange # [GW] QP generalized Kpoint/Band indices
1| 1| 210| 222|
%QPerange # [GW] QP generalized Kpoint/Energy indices
1| 34| 0.0|-1.0|
The job works well only if I use small number of the cores (e.g 8 or less) on HPC clusters:
srun --ntasks=8 --hint=nomultithread --ntasks-per-node=8 --ntasks-per-socket=4 --ntasks-per-core=1 --mem_bind=v,local ${YAMBO_HOME}/bin/yambo -F INPUTS/06_BSE -J 06_BSE

But if I increased the number of cores, like 32 (one node) or even more, it always stopped like
<---> [01] Files & I/O Directories
<---> [02] CORE Variables Setup
<---> [02.01] Unit cells
<01s> [02.02] Symmetries
<01s> [02.03] RL shells
<01s> [02.04] K-grid lattice
<01s> [02.05] Energies [ev] & Occupations
<01s> [03] Transferred momenta grid
<01s> [04] Bare local and non-local Exchange-Correlation
<01s> [Distribute] Average allocated memory is [o/o]: 7.502401
<01s> [M 0.773 Gb] Alloc WF ( 0.721)
<02s> [FFT-HF/Rho] Mesh size: 30 30 95
<02s> [WF-HF/Rho loader] Wfs (re)loading | | [000%] --(E) --(X)
<02s> [M 0.996 Gb] Alloc wf_disk ( 0.222)
<08s> [WF-HF/Rho loader] Wfs (re)loading |# | [009%] 05s(E) 58s(X)
<14s> [WF-HF/Rho loader] Wfs (re)loading |#### | [020%] 11s(E) 56s(X)
<19s> [WF-HF/Rho loader] Wfs (re)loading |###### | [032%] 17s(E) 53s(X)
<25s> [WF-HF/Rho loader] Wfs (re)loading |######## | [043%] 23s(E) 52s(X)
<31s> [WF-HF/Rho loader] Wfs (re)loading |########### | [055%] 28s(E) 51s(X)
<37s> [WF-HF/Rho loader] Wfs (re)loading |############# | [067%] 34s(E) 50s(X)
<42s> [WF-HF/Rho loader] Wfs (re)loading |############### | [079%] 40s(E) 50s(X)
<48s> [WF-HF/Rho loader] Wfs (re)loading |################## | [091%] 46s(E) 50s(X)
<51s> [WF-HF/Rho loader] Wfs (re)loading |####################| [100%] 49s(E) 49s(X)
<51s> [M 0.775 Gb] Free wf_disk ( 0.222)
<51s> EXS | | [000%] --(E) --(X)
<56s> P001: EXS |### | [016%] 05s(E) 29s(X)
<01m-01s> P001: EXS |###### | [033%] 10s(E) 29s(X)
<01m-06s> P001: EXS |########## | [050%] 15s(E) 29s(X)
<01m-11s> P001: EXS |############# | [067%] 20s(E) 29s(X)
<01m-16s> P001: EXS |################ | [084%] 25s(E) 29s(X)
<01m-20s> P001: EXS |####################| [100%] 28s(E) 28s(X)
<01m-20s> [xc] Functional Perdew, Burke & Ernzerhof(X)+Perdew, Burke & Ernzerhof(C)
<01m-20s> [xc] LIBXC used to calculate xc functional
<01m-20s> [M 0.052 Gb] Free WF ( 0.721)
<01m-21s> [05] Dynamic Dielectric Matrix (PPA)
<01m-21s> [Distribute] Average allocated memory is [o/o]: 77.85714
It seems that increasing numbers of cores doesn’t accelerate the calculations. I would like to know how to handle the large system with more than one thousand cores and run the jobs parallel more efficiently.
Jun Yin
Nanyang Technological University

Daniele Varsano
Re: YAMBO parallel for large system

Post by Daniele Varsano » Wed Sep 14, 2016 8:35 am

Dear Jun Yin,

in order to use efficiently a large number of cores you need to update to the 4.x release where the parallelism has been totally revised.
In the 4.x you have flexibility on the parallelization strategy: you can have a look to a simple tutorial here:

In order to activate the variable governing the parallelism you need to add "-V par" in the command line to build up the input files.


Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale

Re: YAMBO parallel for large system

Post by jyin002 » Wed Sep 14, 2016 9:32 am

Dear Daniele,

Thanks a lot for your suggestions. I just tried the same calculation (2 nodes, 64 cores) with YAMBO 4.0.2 by put extra lines in GW input file:
X_all_q_ROLEs= "q k c v" # [PARALLEL] CPUs roles (q,k,c,v)
X_all_q_CPU= "2 4 4 2" # Parallelism over q points only
SE_ROLEs= "q qp b"
SE_CPU= "2 8 2" # Parallilism over q points only
It stopped abnormally and one of file in LOG folder shows:
<01s> P0001: [01] CPU structure, Files & I/O Directories
<01s> P0001: CPU-Threads:64(CPU)-1(threads)-1(threads@X)-1(threads@DIP)-1(threads@SE)-1(threads@RT)-1(threads@K)
<01s> P0001: CPU-Threads:X_all_q(environment)-2 4 4 2(CPUs)-q k c v(ROLEs)
<01s> P0001: CPU-Threads:SE(environment)-2 8 2(CPUs)-q qp b(ROLEs)
<01s> P0001: [02] CORE Variables Setup
<01s> P0001: [02.01] Unit cells
<02s> P0001: [02.02] Symmetries
<02s> P0001: [02.03] RL shells
<02s> P0001: [02.04] K-grid lattice
<02s> P0001: [02.05] Energies [ev] & Occupations
<02s> P0001: [03] Transferred momenta grid
<02s> P0001: [M 0.052 Gb] Alloc bare_qpg ( 0.020)
<02s> P0001: [04] External corrections
<03s> P0001: [05] Dynamic Dielectric Matrix (PPA)
<03s> P0001: [PARALLEL Response_G_space for K(bz) on 4 CPU] Loaded/Total (Percentual):16/64(25%)
<03s> P0001: [PARALLEL Response_G_space for Q(ibz) on 2 CPU] Loaded/Total (Percentual):17/34(50%)
<03s> P0001: [PARALLEL Response_G_space for K(ibz) on 1 CPU] Loaded/Total (Percentual):34/34(100%)
<03s> P0001: [PARALLEL Response_G_space for CON bands on 4 CPU] Loaded/Total (Percentual):16/64(25%)
<03s> P0001: [PARALLEL Response_G_space for VAL bands on 2 CPU] Loaded/Total (Percentual):108/216(50%)
<03s> P0001: Matrix Inversion uses 1 CPUs
<03s> P0001: Matrix Diagonalization uses 1 CPUs
<03s> P0001: [DIP] Checking dipoles header
<03s> P0001: [x,Vnl] computed using 1732 projectors
<03s> P0001: [WARNING] [x,Vnl] slows the Dipoles computation. To neglect it rename the ns.kb_pp file
<03s> P0001: [M 4.194 Gb] Alloc KBV ( 4.129)
<03s> P0001: [M 6.707 Gb] Alloc WF ( 2.513)
<03s> P0001: [PARALLEL distribution for Wave-Function states] Loaded/Total(Percentual):4216/9520(44%)
<05s> P0001: [WF] Performing Wave-Functions I/O from ./SAVE
<05s> P0001: [M 6.782 Gb] Alloc wf_disk ( 0.074)
<05s> P0001: Reading wf_fragments_1_1
<05s> P0001: Reading wf_fragments_1_2
<06s> P0001: Reading wf_fragments_1_3
<06s> P0001: Reading wf_fragments_2_1
<06s> P0001: Reading wf_fragments_2_2
<07s> P0001: Reading wf_fragments_2_3
<07s> P0001: Reading wf_fragments_3_1
<08s> P0001: Reading wf_fragments_3_2
<08s> P0001: Reading wf_fragments_3_3
Could you please help me figure out the problem? Thank you again.
Jun Yin
Nanyang Technological University

Daniele Varsano
Re: YAMBO parallel for large system

Post by Daniele Varsano » Wed Sep 14, 2016 9:48 am

Dear Jun,

in oder to spot the problem the complete input/report and error message would help.
Anyway I can see form your post that:
1) X_all_q_CPU and SE_CPU are inconsistent: (the first 64 cpus, the second 32 cpus)
2) You are allocating 6.782 Gb, check that you have such RAM per core
3) As a tip avoid parallelization on q, usually it results in unbalancing the calculations. So the q role =1
4) In order to distribute memory, try to parallelize on bands (c,v)


Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale

Re: YAMBO parallel for large system

Post by Harshita » Thu Sep 12, 2024 7:35 am

Dear Daniele,

I am having a similar problem while doing HF step (including anisotropy) with YAMBO for a system with 75 atoms. I have tried increasing number of nodes and reducing cores per node to utilize more virtual memory, up to 8 nodes with each node having a dynamical RAM of 256 GB. Not working. Then I tried using the following parallelization :

X_all_q_ROLEs= "q k c v" # [PARALLEL] CPUs roles (q,k,c,v)
X_all_q_CPU= "1 4 4 2" # Parallelism over q points only
SE_ROLEs= "q qp b"
SE_CPU= "1 8 4" # Parallilism over q points only

Still not working.
It always stops at the "[06] Optics" step. Calculating the Dipoles but stops at optical calculation.

Is there any other way, large systems can be handled?
There are 498 filled bands, I have already reduced the total bands in nscf calculation to 1024 bands, conv_thr to '1d-03' and Kmesh to '9 9 1', reducing more will anyway not give correct results.

Can you suggest something? Any help would be highly appreciated.

Thanks and regards
Harshita, Research Scholar, INST

Daniele Varsano
Re: YAMBO parallel for large system

Post by Daniele Varsano » Mon Sep 16, 2024 7:53 am

Dear Harshita,

to optimise the memory distribution, you need to assign all the MPI task to "c" and "v" in X_all_q_CPU and on "b" in SE_CPU.
Anyway, I do not understand why you are talking of "Optics" in an HF calculation.
If the problem perists plese attach yout input/report/log files.

Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale

Re: YAMBO parallel for large system

Post by Harshita » Thu Sep 19, 2024 1:00 pm

Dear Daniele,

It still isn't working. I have attached here my input and log files for your kind reference.

About this
Daniele Varsano wrote: Mon Sep 16, 2024 7:53 am Anyway, I do not understand why you are talking of "Optics" in an HF calculation.
I am calculating the optical properties at HF level for taking into account the material anisotropy.

Thanks and regards
Looking forward to your reply,
Harshita, Research Scholar, INST

Daniele Varsano
Re: YAMBO parallel for large system

Post by Daniele Varsano » Sat Sep 21, 2024 8:36 am

Dear Harshita,

your input file is not correct. You are mixing self energies calculations, i.e. quasiparticle properties, with optics.
My suggestion is to generate an input file using the yambo command line and follow one of the tutorials present in the yambo wiki page.

If you want to calculate optical properties at TD-HF level, you can generate the input file using:

Code: Select all

yambo -o b -k hf -y h
which means optics in transition space with an HF kernel and iterative diagonalization (you can add -V par to add parallelization strategy variable).
You can use yambo -h for getting help in generating input files.

Next, if you choose a parallelization strategy, check carefully your submission script, in the example you attached you asked for 64 MPI in input file but the job was run with 32CPU. What happens in this case is that yambo disregard the parallel strategy you have chosen as reported in the log files: "parallel ENVIRONMENT is incomplete. Switching to defaults".

Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale

Re: YAMBO parallel for large system

Post by Harshita » Mon Sep 23, 2024 12:13 pm

Dear Daniele,

Thanks for your quick suggestions. I am still facing the same problem. Could you please tell me, what am I missing?
Kindly find the input and log files attached herewith.

Harshita, Research Scholar, INST

Daniele Varsano
Re: YAMBO parallel for large system

Post by Daniele Varsano » Wed Sep 25, 2024 8:29 am

Dear Harshita,

using your input, you are trying to build an extremely large matrix. The dimension of the HF matrix is (nc*nv*nk).
The extremely large size is due to the numer of bands you included:

Code: Select all

% BSEBands
    1 | 1024 |                       # [BSK] Bands range
All these bands are not needed, in particular deep occupied state as they do not contribute to the low energy spectrum.
The recipe is to include a few bands across the gap and then enlarge the range of bands until convergence.

Please note, you are indicating parallel strategy for the response function and self-energy, i.e. what is needed for a GW calculation, but this is an optic calculation and variable governing the parallelization for this runlevel are different:

Code: Select all

BS_CPU= ""             # [PARALLEL] CPUs for each role
BS_ROLEs= "k eh t"

They appear in your input when adding the "-V par" in the input generation command line.

Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale

