memory issue in G0W0 run

Concerns issues with computing quasiparticle corrections to the DFT eigenvalues - i.e., the self-energy within the GW approximation (-g n), or considering the Hartree-Fock exchange only (-x)

Moderators: Davide Sangalli, andrea.ferretti, myrta gruning, andrea marini, Daniele Varsano

Post Reply
BMarco
Posts: 1
Joined: Tue Nov 22, 2011 9:11 pm

memory issue in G0W0 run

Post by BMarco » Tue Nov 22, 2011 11:10 pm

Hi Yambo community,
I'm trying to do a G0W0 calculation on a graphene layer with 128 electrons (input file attached). I'm running on 6 processors (1processor per node) and allocating 16Gb memory per processor. Just to give you a sense of the size of the calculation, a colleague has done a very similar calculation using 5Gb memory on the Cineca sp6 machine, on 4 processors in 60 hours.
Even though there should be enough memory, the calculation crashes in a few minutes saying:
<03m-01s> [08] Bare local and non-local Exchange-Correlation
[ERROR] STOP signal received while in :[08] Bare local and non-local Exchange-Correlation
[ERROR]Mem All. failed. Element WF require 0.00000 [Gb]

(l_* and r_* files also attached, together with compilation options used).

We checked that Yambo was only using 3-4% on the memory allocated on each node when it crashed.
I also tried to decrease the RL cutoff to ~10000 : in that case the calculation can get to the next step, but then it crashes with a similar error:
[09] Dynamic Dielectric Matrix (PPA)
[ERROR]Mem All. failed. Element WF require 1.46268 [Gb]

Yambo runs otherwise fine on simple tests on our machine (we tried bulk Si RPA and the SiH4 tutorial), but we installed Yambo very recently and we're new to it, so there might be still a few things left to adjust, and I'm asking for your kind help on this.
Many thanks in advance.
Best,

Marco Bernardi
Ph.D. Candidate
Grossman Group at MIT (zeppola.mit.edu)
e-mail: bmarco@mit.edu
Phone: +1-617-258-0222
You do not have the required permissions to view the files attached to this post.

User avatar
Daniele Varsano
Posts: 3838
Joined: Tue Mar 17, 2009 2:23 pm
Contact:

Re: memory issue in G0W0 run

Post by Daniele Varsano » Wed Nov 23, 2011 12:22 pm

Dear Marco,
well, this error is quite disturbing:
[ERROR]Mem All. failed. Element WF require 0.00000 [Gb]
and may we need to reproduce your error in order to understand what is happening.

Anyway looking at your report, I noticed that in your database you have 40075 G-vectors,
Electrons : 128.0000
WF G-vectors : 40075

and you are trying to use 45600 plane waves to build up the exchange self-energy.
Did you use the MaxGvec keyword when running the setup?
In this case, try to rerun the setup using all the components.
Next, even if not related with your error, I can see that you linked to netcdf libraries
in your configure (which is recommended) , but something failed (may they are compiled
with a different compiler?...the module has not be loaded?), as you have the databases
named s.db1 etc. insetad of ns.db1 etc.., check when configuring yambo, that the box
with netcdf option is marked.

Cheers,

Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/

ljzhou86
Posts: 85
Joined: Fri May 03, 2013 10:20 am

Re: memory issue in G0W0 run

Post by ljzhou86 » Mon Dec 08, 2014 12:49 am

Dear developers:

I also had a similar question about the memory issue when I do a gw calculation, which input generated by "yambo -p p -g n". The related information as follows, please help me to resolve it. Thanks in advance!

yambo.in
#
# GPL Version 3.4.1 Revision 3187
# http://www.yambo-code.org
#
gw0 # [R GW] GoWo Quasiparticle energy levels
ppa # [R Xp] Plasmon Pole Approximation
HF_and_locXC # [R XX] Hartree-Fock Self-energy and Vxc
em1d # [R Xd] Dynamical Inverse Dielectric Matrix
EXXRLvcs= 32729 RL # [XX] Exchange RL components
Chimod= "300" # [X] IP/Hartree/ALDA/LRC/BSfxc
% BndsRnXp
1 | 250 | # [Xp] Polarization function bands
%
NGsBlkXp= 1 RL # [Xp] Response block size
% LongDrXp
1.000000 | 0.000000 | 0.000000 | # [Xp] [cc] Electric Field
%
PPAPntXp= 27.21138 eV # [Xp] PPA imaginary energy
% GbndRnge
1 | 250 | # [GW] G[W] bands range
%
GDamping= 0.10000 eV # [GW] G[W] damping
dScStep= 0.10000 eV # [GW] Energy step to evalute Z factors
DysSolver= "n" # [GW] Dyson Equation solver (`n`,`s`,`g`)
%QPkrange # [GW] QP generalized Kpoint/Band indices
1| 72| 1|250|
%
%QPerange # [GW] QP generalized Kpoint/Energy indices
1| 72| 0.0|-1.0|
%

l_qp_em1d_ppa_HF_and_locXC_gw0 is as follows:

___ __ _____ __ __ _____ _____
| Y || _ || Y || _ \ | _ |
| | ||. | ||. ||. | / |. | |
\_ _/ |. _ ||.\_/ ||. _ \ |. | |
|: | |: | ||: | ||: | \|: | |
|::| |:.|:.||:.|:.||::. /|::. |
`--" `-- --"`-- --"`-----" `-----"

<---> [01] Files & I/O Directories
<---> [02] CORE Variables Setup
<---> [02.01] Unit cells
<---> [02.02] Symmetries
<---> [02.03] RL shells
<---> [02.04] K-grid lattice
<---> [02.05] Energies [ev] & Occupations
<---> [03] Transferred momenta grid


<---> [04] Bare local and non-local Exchange-Correlation
[ERROR] STOP signal received while in :[04] Bare local and non-local Exchange-Correlation
[ERROR]Mem All. failed. Element WF require 0.00000 [Gb]




r-05_qp_em1d_ppa_HF_and_locXC_gw0


YAMBO@jj15c29 x 096 CPUs * 12/07/2014 15:49

[01] Files & I/O Directories
============================

CORE databases in .
Additional I/O in .
Communications in .
Input file is yambo.in
Report file is ./r-05_qp_em1d_ppa_HF_and_locXC_gw0
Job string is 05_qp
Log file is ./l-05_qp_em1d_ppa_HF_and_locXC_gw0

[RD./SAVE//ns.db1]------------------------------------------
Bands : 250
K-points : 72
G-vectors [RL space]: 260023
Components [wavefunctions]: 32736
Symmetries [spatial+T-rev]: 2
Spinor components : 1
Spin polarizations : 1
Temperature [ev]: 0.000000
Electrons : 220.0000
WF G-vectors : 43147
Max atoms/species : 44
No. of atom species : 1
Magnetic symmetries : no
- S/N 009318 --------------------------- v.03.04.01 r.3187 -

[02] CORE Variables Setup
=========================


[02.01] Unit cells
==================

Unit cell is Unknown

... containing 44P atoms

... with scaling factors [a.u.]: 52.35866 6.23515 47.24317

Direct Lattice(DL) unit cell [iru / cc(a.u.)]
A1 = 1.000000 0.000000 0.000000 52.35866 0.000000 0.000000
A2 = 0.000000 1.000000 0.000000 0.000000 6.235154 47.24317
A3 = 0.000000 0.000000 1.000000 0.000000 0.000000 47.24317

DL volume [au]:0.1542E+5

Reciprocal Lattice(RL) unit cell [iku / cc]
B1 = 1.000000 0.000000 0.000000 0.120003 0.000000 0.000000
B1 = 0.000000 1.000000 0.000000 0.000000 1.007704 0.000000
B1 = 0.000000 0.000000 1.000000 0.000000 0.000000 0.132997


[02.02] Symmetries
==================

DL (S)ymmetries [cc]
[S1] 1.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 1.000000

[SYMs] Time-reversal derived K-space symmetries: 2 2
[SYMs] Spatial inversion 2 is NOT a symmetry
[SYMs] Group table built correctly

[02.03] RL shells
=================

Shells, format: [S#] G_RL(mHa)

[S4806]:32729(0.1262E+5) [S4805]:32721(0.1262E+5) [S4804]:32713(0.1262E+5) [S4803]:32705(0.1262E+5)
[S4802]:32701(0.1261E+5) [S4801]:32693(0.1261E+5) [S4800]:32685(0.1261E+5) [S4799]:32681(0.1261E+5)
[S4798]:32673(0.1261E+5) [S4797]:32665(0.1261E+5) [S4796]:32657(0.1261E+5) [S4795]:32649(0.1261E+5)
[S4794]:32641(0.1260E+5) [S4793]:32633(0.1260E+5) [S4792]:32629(0.1260E+5) [S4791]:32621(0.1260E+5)
[S4790]:32613(0.1259E+5) [S4789]:32605(0.1259E+5) [S4788]:32597(0.1259E+5) [S4787]:32585(0.1259E+5)
[S4786]:32577(0.1259E+5) [S4785]:32569(0.1258E+5) [S4784]:32561(0.1258E+5) [S4783]:32553(0.1258E+5)
[S4782]:32549(0.1257E+5) [S4781]:32541(0.1257E+5) [S4780]:32533(0.1257E+5) [S4779]:32525(0.1257E+5)
...
[S12]:33( 79.59655) [S11]:31( 73.64708) [S10]:27( 64.80301) [S9]:25( 64.17758)
[S8]:21( 42.57657) [S7]:17( 37.64540) [S6]:13( 35.37624) [S5]:11( 28.80134)
[S4]:9( 16.04440) [S3]:5( 8.844060) [S2]:3( 7.200335) [S1]:1( 0.000000)

[02.04] K-grid lattice
======================

Compatible Grid is 2D
B1 [rlu]= 0.07143 0.00000 0.00000
B2 = 0.000000 -0.100000 0.000000
Grid dimensions : 14 10
K lattice UC volume [au]:0.1149E-3

[02.05] Energies [ev] & Occupations
===================================

Fermi Level [ev]:-0.627056
Electronic Temp. [ev K]: 0.00 0.00
Bosonic Temp. [ev K]: 0.00 0.00
El. density [cm-3]: 0.963E+23
States summary : Full Metallic Empty
0001-0110 0111-0250
Indirect Gaps [ev]: 0.720897 1.764378
Direct Gaps [ev]: 0.720897 2.224995
X BZ K-points : 140

Energy unit is electronVolt [eV]

*X* K [1] : 0.000000 0.000000 0.000000 (cc ) * Comp.s 32297 * weight 0.0071
0.000000 0.000000 0.000000 (iku)
E -15.31026 -15.14116 -15.13344 -14.91702 -14.78473 -14.77657 -14.63885 -14.62416
E -14.39564 -14.38989 -13.87777 -13.82162 -13.79450 -13.73244 -13.16560 -13.16038
E -12.72900 -12.70233 -12.65262 -12.64463 -12.46140 -11.91128 -10.02111 -9.18990
E -9.151641 -9.096883 -9.004974 -8.998000 -8.966287 -8.961825 -8.414806 -8.396224
Dr. Zhou Liu-Jiang
Fujian Institute of Research on the Structure of Matter
Chinese Academy of Sciences
Fuzhou, Fujian, 350002

User avatar
Daniele Varsano
Posts: 3838
Joined: Tue Mar 17, 2009 2:23 pm
Contact:

Re: memory issue in G0W0 run

Post by Daniele Varsano » Mon Dec 08, 2014 7:39 am

Dear Zhou,

first please note that
Chimod= "300" # [X] IP/Hartree/ALDA/LRC/BSfxc
this does not male much sense, use the keyword Hartree if you want to calculate the screening in RPA approximation.
Next, why do you need qp correction for all this bands and k-points?
%QPkrange # [GW] QP generalized Kpoint/Band indices
1| 72| 1|250|
%
This is a very big number: 18000 GW calculations!!! If you need to look at the band structures you need to look at bands near the Fermi energy, or few conduction bands. If you need them for a Bethe Salpeter calculation, you do not need such a big number and if needed you can interpolate them.
Next consider to raise the values of NGsBlkXp= 1 RL , as in this way you are neglecting local field effects.
Anyway your memory problem happens in the exchange part, reduce the QPkrange, to the bands you really need, eventually you can think about lowering the Gvectors in exchange (EXXRLvcs) and to set up also the FFTGvecs variable (to the same value of EXXRLvcs), anyway you ned to check you are not loosing accuracy whne lowering these two variables.

Best,
Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/

ljzhou86
Posts: 85
Joined: Fri May 03, 2013 10:20 am

Re: memory issue in G0W0 run

Post by ljzhou86 » Mon Dec 08, 2014 11:36 am

Dear Daniele:
I have revise this yambo.in as you told me,including Chimod, NgsBlkXp,EXXRLvcs and FFTGvecs, however, the job inexplicably stopped as follows:


<---> [01] Files & I/O Directories
<---> [02] CORE Variables Setup
<---> [02.01] Unit cells
<---> [02.02] Symmetries
<---> [02.03] RL shells
<---> [02.04] K-grid lattice
<---> [02.05] Energies [ev] & Occupations
<---> [03] Transferred momenta grid
<---> [04] Bare local and non-local Exchange-Correlation
<05s> [M 3.811 Gb] Alloc WF ( 3.780)
<07s> [FFT-HF/Rho] Mesh size: 90 9 75
l-06_qp_em1d_ppa_HF_and_locXC_gw0 lines 1-20/20 (END)

When I check the err out generated by my pbs-job submit, some strange errs are as follows:

PSI: Found batch system of PBS flavour. Ignoring any choices of nodes or hosts.
PSIlogger: Child with rank 0 exited on signal 9: Killed
PSIlogger: Child with rank 3 exited on signal 9: Killed
PSIlogger: Child with rank 12 exited on signal 9: Killed

Hope it can help you to analysis. Thanks in advance.
Dr. Zhou Liu-Jiang
Fujian Institute of Research on the Structure of Matter
Chinese Academy of Sciences
Fuzhou, Fujian, 350002

User avatar
Daniele Varsano
Posts: 3838
Joined: Tue Mar 17, 2009 2:23 pm
Contact:

Re: memory issue in G0W0 run

Post by Daniele Varsano » Mon Dec 08, 2014 12:07 pm

Dear Zhou,
unfortunately I cannot help you on that, as it is just the message from the PBS system, and doe not saying nothing on what is going wrong.
Please consider that the most problematic variable in your input was QPkrange. In any case, when facing problems, please post input/output and report.
Consider instead of copy/paste, to attach them as files, as it is easier to read.
Best,
Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/

ljzhou86
Posts: 85
Joined: Fri May 03, 2013 10:20 am

Re: memory issue in G0W0 run

Post by ljzhou86 » Thu Dec 11, 2014 6:21 pm

Dear Dr. Daniele Varsano

I have revised the input file as you suggested. However, these problems still existed. It seem to be that the the job running cannot get pass this step [04] Bare local and non-local Exchange-Correlation showing "[ERROR] STOP signal received while in :[04] Bare local and non-local Exchange-Correlation
[ERROR]Mem All. failed. Element WF require 0.00000 [Gb]".

Note that the system studied consists of 44 atoms, I want to know whether it is not enough to do a GW+BSE calculation with 10 nodes (160 cpu cores), seeing the job submitting script (sub.sh)

plz see the attachment for finding the answer how to resolve problem. Thanks in advance.
Dr. Zhou Liu-Jiang
Fujian Institute of Research on the Structure of Matter
Chinese Academy of Sciences
Fuzhou, Fujian, 350002

User avatar
Daniele Varsano
Posts: 3838
Joined: Tue Mar 17, 2009 2:23 pm
Contact:

Re: memory issue in G0W0 run

Post by Daniele Varsano » Mon Dec 15, 2014 1:36 pm

Dear Zhou,
I cannot see the attached file.
Please attach all the information that can be useful to spot the problem ( input/report/standard output/script etc...).

Best,
Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/

Post Reply