I/O activity from yambo jobs is too high for the filesystem

Run-time issues concerning Yambo that are not covered in the above forums.

Moderators: myrta gruning, andrea marini, Daniele Varsano, Conor Hogan

Post Reply
anhhv
Posts: 20
Joined: Mon Nov 18, 2019 6:53 am
Location: Austin, TX, USA
Contact:

I/O activity from yambo jobs is too high for the filesystem

Post by anhhv » Thu Jan 30, 2020 10:27 pm

Dear yambo developers and users,

I was running GW using yambo in parallel over 4232 CPUs. However, the admins of clusters said that the code cause in/out problem as the I/O activities are too many for the filesystem to handle. Initially, I think that might be at the number of LOG files produced. When trying to use NLogCPUs variable, I decreased the number of LOG files about 46 times. Although the number LOG files generated are still 4232, the number of LOG files were frequently written only 92. However, the problem wasn't fixed. It seems that the I/O problem came out at step [06.01] G0W0 (W PPA) when yambo reading "ndb.pp_fragment_x" files. In the following, I pasted my input file

HF_and_locXC # [R XX] Hartree-Fock Self-energy and Vxc
ppa # [R Xp] Plasmon Pole Approximation
gw0 # [R GW] GoWo Quasiparticle energy levels
em1d # [R Xd] Dynamical Inverse Dielectric Matrix
EXXRLvcs= 90 Ry # [XX] Exchange RL components
Chimod= "HARTREE" # [X] IP/Hartree/ALDA/LRC/PF/BSfxc
GfnQPdb= "E < ./SAVE/ndb.QP"
XfnQPdb= "E < ./SAVE/ndb.QP"
% BndsRnXp
1 | 598 | # [Xp] Polarization function bands
%
NGsBlkXp= 10 Ry # [Xp] Response block size
% LongDrXp
1.000000 | 1.000000 | 1.000000 | # [Xp] [cc] Electric Field
%
PPAPntXp= 21 eV # [Xp] PPA imaginary energy
% GbndRnge
1 | 598 | # [GW] G[W] bands range
%
GDamping= 0.10000 eV # [GW] G[W] damping
dScStep= 0.10000 eV # [GW] Energy step to evaluate Z factors
DysSolver= "n" # [GW] Dyson Equation solver ("n","s","g")
%QPkrange # # [GW] QP generalized Kpoint/Band indices
1|47|37|64|
%
X_all_q_CPU= "1 2 46 46" # [PARALLEL] CPUs for each role
X_all_q_ROLEs= "q k c v" # [PARALLEL] CPUs roles (q,k,c,v)
X_all_q_nCPU_LinAlg_INV= 4232 # [PARALLEL] CPUs for Linear Algebra
X_Threads= 0 # [OPENMP/X] Number of threads for response functions
DIP_Threads= 0 # [OPENMP/X] Number of threads for dipoles
SE_CPU= "47 1 46" # [PARALLEL] CPUs for each role
SE_ROLEs= "q qp b" # [PARALLEL] CPUs roles (q,qp,b)
SE_Threads= 0
NLogCPUs = 92 # [PARALLEL] Live-timing CPU`s (0 for all)


The system has 47 k-points and 46 valence bands. Here, I set SE_CPU= "47 1 46", so the number of CPUs used for calculation of self-energy is about a half of the total CPUs. The reason is, these steps require more memory than polarization calculations before (X_all_q_CPU= "1 2 46 46" ). If I set SE_CPU= "46 2 46", I will get a lack of memory when computing self-energy. I attached here two log files and the standard output file for more details.

Please give me some advice to reduce the I/O activities. Thank you.
You do not have the required permissions to view the files attached to this post.
Viet-Anh Ha,
Oden Institute for Computational Engineering and Sciences,
https://www.oden.utexas.edu/
The University of Texas at Austin,
https://www.utexas.edu/
201 E 24th St, Austin, TX 78712, USA.

User avatar
Daniele Varsano
Posts: 3816
Joined: Tue Mar 17, 2009 2:23 pm
Contact:

Re: I/O activity from yambo jobs is too high for the filesystem

Post by Daniele Varsano » Thu Jan 30, 2020 11:02 pm

Dear Viet-Anh Ha,

here some suggestions:
1) Log files, you can reduce and print very few of them, let's a total of 10 could be enough to monitor the run
2) Self-energy parallelization, I would avoid parallelizations over q while I would push on bands: you can put much more than the number of occupied bands. If the code complains (it shouldn't), this is because first, he calculates exchange self-energy (Sx) where only occupied bands are needed. If it happens you can calculate the Sx in a first run (this should be rather fast) and once this is calculated and stored in the database you can calculate the correlation part by using many CPU in SE_roles=b. In this way, the memory is distributed over CPUs
3) If this is not enough you can think to fragment you calculations: e.g. divides the content of QPkrange in multiple runs and merge the database at the end by using the ypp utility.

Hopefully, others can provide you with more advises.

Best,
Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/

anhhv
Posts: 20
Joined: Mon Nov 18, 2019 6:53 am
Location: Austin, TX, USA
Contact:

Re: I/O activity from yambo jobs is too high for the filesystem

Post by anhhv » Fri Jan 31, 2020 9:36 pm

Hi Daniele,

Thank you for your suggestions. For point 1 & 3, they're clear. For point 2, I didn't completely get your idea. You mean

First, I should try, e. g.:

SE_CPU= "1 1 598" # [PARALLEL] CPUs for each role
SE_ROLEs= "q qp b" # [PARALLEL] CPUs roles (q,qp,b)

If the error is raised as the nband > nband-valence=46, I should move to the option 2: "calculate the Sx in a first run (this should be rather fast) and once this is calculated and stored in the database you can calculate the correlation part by using many CPU in SE_roles=b. In this way, the memory is distributed over CPUs". That means I should perform two run separately? Can you tell me clearly how to set SE_CPU as well as SE_ROLEs

How about parallelizations over "qp"? Thank you.
Viet-Anh Ha,
Oden Institute for Computational Engineering and Sciences,
https://www.oden.utexas.edu/
The University of Texas at Austin,
https://www.utexas.edu/
201 E 24th St, Austin, TX 78712, USA.

User avatar
Daniele Varsano
Posts: 3816
Joined: Tue Mar 17, 2009 2:23 pm
Contact:

Re: I/O activity from yambo jobs is too high for the filesystem

Post by Daniele Varsano » Sat Feb 01, 2020 9:16 am

Dear Viet-Anh Ha,
That means I should perform two run separately? Can you tell me clearly how to set SE_CPU as well as SE_ROLEs
Yes, the first run for Sigma_x (HF_and_locXC) where you can set e.g:

Code: Select all

SE_CPU= "1 46 23" # [PARALLEL] CPUs for each role
SE_ROLEs= "q qp b" # [PARALLEL] CPUs roles (q,qp,b)
or alike..., considering you need to calculate many qp corrections (~1300) you can also parallelize on "qp" (memory permitting).
Next a second run for the Sigma_c calculations and GW convolution:
HF_and_locXC,gw0,ppa,dyson,em1d ....
where you push on bands in order to reduce the memory per core (of course you cannot set this number higher than the number of the used bands):

Code: Select all

SE_CPU= "1 1 598" # [PARALLEL] CPUs for each role
SE_ROLEs= "q qp b" # [PARALLEL] CPUs roles (q,qp,b)
This will reduce also the I/0 as each core will load a few wavefunctions.
How about parallelizations over "qp"?
This is very efficient in term of scaling as calculations for different "qp" are totally independent, so the computing time scales linearly, but there is not memory distribution. If you can reduce the mpi-task per node and use more nodes you can use it. In order to check the memory per core needed for the calculations, you can compile the code by adding --enable-memory-profile in the configure, in this way the memory allocated by the code will be printed in the log file.
You can also think about exploiting mixed MPI/OMP parallelism.
Some tips on parallel strategy in yambo can be found here:
http://www.yambo-code.org/wiki/index.ph ... n_parallel
and here:
http://www.yambo-code.org/wiki/index.ph ... strategies

Best,
Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/

Post Reply