Reduce memory for each CPU in yambo 4.0.1 rev89

Various technical topics such as parallelism and efficiency, netCDF problems, the Yambo code structure itself, are posted here.

Moderators: Davide Sangalli, andrea.ferretti, myrta gruning, andrea marini, Daniele Varsano, Conor Hogan, Nicola Spallanzani

Post Reply
caowd
Posts: 22
Joined: Mon Jan 20, 2014 4:13 am

Reduce memory for each CPU in yambo 4.0.1 rev89

Post by caowd » Fri Oct 09, 2015 2:44 am

Dear

I have been calculating the linear response of a 3D system with a very large number of kpoints.
And I find that it is very hard for me to reduce the memory cost for each CPU only by tuning the parallel configuration (q,k,c,v).

Could anyone please give some advice about that?
Wen-Dong Cao
Candidate for Ph.D
Department of Physics
Group of Condensed Material Theory
Tsinghua University
Beijing P.R.China
+86 010 62772784

User avatar
Daniele Varsano
Posts: 3816
Joined: Tue Mar 17, 2009 2:23 pm
Contact:

Re: Reduce memory for each CPU in yambo 4.0.1 rev89

Post by Daniele Varsano » Fri Oct 09, 2015 2:50 pm

Dear Wen-Dong Cao,

please consider that the 4.0.1 release is still to be meant a develop version, and we are still working to stabilize and improve its performance.
A new release will come very soon.
Next, in general, if you have memory problem, the strategy is to assign more processor in bands (c,v) rather than in q,k.
If you post here your input/report/ and a typical log files we can try to be more specific.

Best,

Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/

caowd
Posts: 22
Joined: Mon Jan 20, 2014 4:13 am

Re: Reduce memory for each CPU in yambo 4.0.1 rev89

Post by caowd » Sun Oct 11, 2015 12:30 pm

Dear

my typical input file yambo.in is
--------------------------------------------
optics # [R OPT] Optics
chi # [R CHI] Dyson equation for Chi.
FFTGvecs= 3000 RL # [FFT] Plane-waves
ElecTemp = 0.00
X_q_0_CPU= "8" # [PARALLEL] CPUs for each role
X_q_0_ROLEs= "k" # [PARALLEL] CPUs roles (k,c,v)
X_q_0_nCPU_invert=0 # [PARALLEL] CPUs for matrix inversion
X_finite_q_CPU= "2,4" # [PARALLEL] CPUs for each role
X_finite_q_ROLEs= "c,v" # [PARALLEL] CPUs roles (q,k,c,v)
X_finite_q_nCPU_invert=0 # [PARALLEL] CPUs for matrix inversion
Chimod= "Hartree" # [X] IP/Hartree/ALDA/LRC/BSfxc
NGsBlkXd= 400 RL # [Xd] Response block size
% QpntsRXd
20 | 20 | # [Xd] Transferred momenta
%
% BndsRnXd
1 | 24 | # [Xd] Polarization function bands
%
% EnRngeXd
0.00000 | 0.20000 | eV # [Xd] Energy range
%
% DmRngeXd
0.0040 | 0.00400 | eV # [Xd] Damping range
%
ETStpsXd= 400 # [Xd] Total Energy steps
% LongDrXd
0.000000 | 0.000000 | 1.000000 | # [Xd] [cc] Electric Field
%
--------------------------------------------

my typical log file is like
-------------------------------------------


__ __ ______ ____ _____
/\ \ /\ \\ _ \ /"\_/`\/\ _`\ /\ __`\
\ `\`\\/"/ \ \L\ \/\ \ \ \L\ \ \ \/\ \
`\ `\ /" \ \ __ \ \ \__\ \ \ _ <" \ \ \ \
`\ \ \ \ \ \/\ \ \ \_/\ \ \ \L\ \ \ \_\ \
\ \_\ \ \_\ \_\ \_\\ \_\ \____/\ \_____\
\/_/ \/_/\/_/\/_/ \/_/\/___/ \/_____/


<01s> P0001: [01] CPU structure, Files & I/O Directories
<01s> P0001: CPU-Threads:8(CPU)-1(threads)-1(threads@X)-1(threads@DIP)-1(threads@SE)-1(threads@RT)-1(threads@K)
<01s> P0001: CPU-Threads:X_q_0(environment)-4(CPUs)-k(ROLEs)
<01s> P0001: CPU-Threads:X_finite_q(environment)-2,4(CPUs)-c,v(ROLEs)
<01s> P0001: [02] CORE Variables Setup
<01s> P0001: [02.01] Unit cells
<01s> P0001: [02.02] Symmetries
<01s> P0001: [02.03] RL shells
<01s> P0001: [02.04] K-grid lattice
<10s> P0001: [02.05] Energies [ev] & Occupations
<16s> P0001: [WARNING]Metallic system
<54s> P0001: [03] Transferred momenta grid
-------------------------------------------

So from the log file, it seems that the program stops at the very beginning when the tranferred momenta q were calculated.

The report file is too large to be uploaded. Part of its content is
----------------------------------------------------------------------------
[01] CPU structure, Files & I/O Directories
===========================================

* CPU-Threads :8(CPU)-1(threads)-1(threads@X)-1(threads@DIP)-1(threads@SE)-1(threads@RT)-1(threads@K)
* CPU-Threads :X_q_0(environment)-4(CPUs)-k(ROLEs)
* CPU-Threads :X_finite_q(environment)-2,4(CPUs)-c,v(ROLEs)
* MPI CHAINS : 3
* MPI CPU : 8
* THREADS (max): 1
* THREADS TOT(max): 8
* I/O NODES : 1
* Fragmented I/O :yes

CORE databases in .
Additional I/O in .
Communications in .
Input file is yambo.in
Report file is ./r_optics_chi
Log files in ./LOG

[RD./SAVE//ns.db1]------------------------------------------
Bands : 24
K-points : 9947
G-vectors [RL space]: 37113
Components [wavefunctions]: 4689
Symmetries [spatial]: 12
Spinor components : 2
Spin polarizations : 1
Temperature [ev]: 0.02585
Electrons : 15.99750
WF G-vectors : 5789
Max atoms/species : 6
No. of atom species : 2
Magnetic symmetries : no
- S/N 007947 --------------------------- v.04.00.01 r.0088 -
[02] CORE Variables Setup
=========================


[02.01] Unit cells
==================

~~~~~~~~~~~~~~~~~~~skipped~~~~~~~~~~~~~~~~~~~~~~~~

[02.04] K-grid lattice
======================

Compatible Grid is 3D
B1 [rlu]= 0.00000 0.00000 -0.02778
B2 = 0.00000 0.01786 0.00000
B3 =-.179E-01 -.931E-09 0.00
Grid dimensions : 36 56 56
K lattice UC volume [au]:0.1312E-5

[02.05] Energies [ev] & Occupations
===================================

Fermi Level [ev]: 1.845806
VBM / CBm [ev]: 0.00 0.00
Electronic Temp. [ev K]: 0.1000E-3 1.160
Bosonic Temp. [ev K]: 0.1000E-3 1.160
El. density [cm-3]: 0.645E+23
States summary : Full Metallic Empty
0001-0014 0015-0016 0017-0024

[WARNING]Metallic system

N of el / N of met el: 15.99750 1.99411
Average metallic occ.: 0.997054
X BZ K-points : 112896

Energy unit is electronVolt [eV]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~skipped~~~~~~~~~~~~~~~

[03] Transferred momenta grid
=============================

[RD./SAVE//ndb.kindx]---------------------------------------
Polarization last K : 9947
QP states : 1 9947
X grid is uniform :yes
X grid impose -q :no
BS scattering :no
COLL scattering :no
- S/N 007947 --------------------------- v.04.00.01 r.0088 -
----------------------------------------------------------------------------
Wen-Dong Cao
Candidate for Ph.D
Department of Physics
Group of Condensed Material Theory
Tsinghua University
Beijing P.R.China
+86 010 62772784

User avatar
Daniele Varsano
Posts: 3816
Joined: Tue Mar 17, 2009 2:23 pm
Contact:

Re: Reduce memory for each CPU in yambo 4.0.1 rev89

Post by Daniele Varsano » Sun Oct 11, 2015 4:42 pm

Dear Wen-Dong Cao,
A possible problem can be that the parallelization is not complete:
Instead of:
X_q_0_CPU= "8"
X_q_0_ROLEs= "k"

please use:
X_q_0_CPU= "8 1 1"
X_q_0_ROLEs= "k c v"

and so on. I can see that you are interested at one finite q only so again:
X_finite_q_CPU= "2,4" # [PARALLEL] CPUs for each role
X_finite_q_ROLEs= "c,v" # [PARALLEL] CPUs roles (q,k,c,v)
substitute with :
X_finite_q_CPU= "1 1 2 4" # [PARALLEL] CPUs for each role
X_finite_q_ROLEs= "q k c v" # [PARALLEL] CPUs roles (q,k,c,v)

Anyway as you said on the first post you have a very large number of K points ( 9947) in IBZ which turns into 112896 in the entire BZ, so may be memory it is an issue as you are using 8 cpus only. You do not have any error messages in any of the 8 LOG files, or in the queue output?
The q momenta have been calculated in the setup and read in the present run:

Code: Select all

[RD./SAVE//ndb.kindx]--
so it looks to stop when it starts to calculate dipoles for the non interacting response function. You can try to have a look in the log files how the task are shared among processors, if it arrives to write it, or you can also lower the G vectors via FFTGvecs, and the number od empty bands in the response in order to see if it is a memory issue.

Best,

Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/

caowd
Posts: 22
Joined: Mon Jan 20, 2014 4:13 am

Re: Reduce memory for each CPU in yambo 4.0.1 rev89

Post by caowd » Mon Oct 12, 2015 6:55 am

Dear Daniele

Thank you very much for your quick response.

I will use the complete parallelization from now on.
And I am sure that there is NO ERROR message in any files (log, report, output ...).

After I reducing FFTGvecs, the memory indeed drops down.
So maybe the problem is that the number of nodes is not enough for my job. I will try 16 or 32 nodes.

Thank you again for your help.
Wen-Dong Cao
Candidate for Ph.D
Department of Physics
Group of Condensed Material Theory
Tsinghua University
Beijing P.R.China
+86 010 62772784

Post Reply