Paralelization incompatibility between QE and YAMBO

You can find here problems arising when using old releases of Yambo (< 5.0). Issues as parallelization strategy, performance issues and other technical aspects.

Moderators: Davide Sangalli, andrea.ferretti, myrta gruning, andrea marini, Daniele Varsano, Conor Hogan

Locked
luca.montana
Posts: 27
Joined: Fri Jun 13, 2014 6:52 pm

Paralelization incompatibility between QE and YAMBO

Post by luca.montana » Tue Mar 14, 2017 7:50 am

Dear Developers,

I would like to make you aware of the parallelization incompatibility between QE and YAMBO_4.2.1.

Whenever i parallelize over bands in QE in this form :

Code: Select all

 ../pw.x  -nk 8 nb 4   ... 
i get later the following error in Yambo_4.2.1 in my full frequency-dependent GW calculation :

Code: Select all

double free or corruption (out): 0x0000000003749030 ***


However, when i in QE avoid parallelization over bands, everything is fine.
Maybe this is something you may want to fix it in future releases.


Best wishes
LUCA
Luca Montana
PhD student
University of York, UK

andrea.ferretti
Posts: 214
Joined: Fri Jan 31, 2014 11:13 am

Re: Paralelization incompatibility between QE and YAMBO

Post by andrea.ferretti » Tue Mar 14, 2017 11:19 am

Dear Luca,

thanks for pointing this out.
The only coupling bewteen yambo ad QE is through the datafilee produced by p2y, which in turns read the data dumped by QE.
If such data depends on the actual parallelism of espresso, the problem is more likely in there (when using the flag collect_wf in pw no dependency on the
QE parallelism should be present).

Nevertheless, if you can provide a test case showing the problem I'll try to have a look.

thanks
Andrea

BTW: which version of QE are you running ?
in your command line argument you probably meant "-nb 4", right ?
Andrea Ferretti, PhD
CNR-NANO-S3 and MaX Centre
via Campi 213/A, 41125, Modena, Italy
Tel: +39 059 2055322; Skype: andrea_ferretti
URL: http://www.nano.cnr.it

luca.montana
Posts: 27
Joined: Fri Jun 13, 2014 6:52 pm

Re: Paralelization incompatibility between QE and YAMBO

Post by luca.montana » Tue Mar 14, 2017 12:51 pm

Dear Andrea,

many thanks for your reply.

yes, i meant -nb 4.

You can download the LiF.zip file from my recent post here : viewtopic.php?f=13&t=1298

and if you run the scf with -nk 8 -nb 4 options, and later the GW calculation, you should see the above error.

I am using the v.5.3.0 QE version.

Best wishes
Luca
Luca Montana
PhD student
University of York, UK

andrea.ferretti
Posts: 214
Joined: Fri Jan 31, 2014 11:13 am

Re: Paralelization incompatibility between QE and YAMBO

Post by andrea.ferretti » Tue Mar 14, 2017 1:16 pm

Hi Luca,

thanks, I'll have a look.
Concerning the band parallelism, could you check if the problem persists with qe-6.0 or 6.1 (ban parallelism is a feature still undergoing changes in QE/pw
and we need to make sure the problem is still there)

Andrea
Andrea Ferretti, PhD
CNR-NANO-S3 and MaX Centre
via Campi 213/A, 41125, Modena, Italy
Tel: +39 059 2055322; Skype: andrea_ferretti
URL: http://www.nano.cnr.it

luca.montana
Posts: 27
Joined: Fri Jun 13, 2014 6:52 pm

Re: Paralelization incompatibility between QE and YAMBO

Post by luca.montana » Tue Mar 14, 2017 1:55 pm

Dear Andrea,

I checked it with QE_6.0, and still get the error.

Best
Luca
Luca Montana
PhD student
University of York, UK

andrea.ferretti
Posts: 214
Joined: Fri Jan 31, 2014 11:13 am

Re: Paralelization incompatibility between QE and YAMBO

Post by andrea.ferretti » Thu Mar 23, 2017 10:19 am

Dear Luca,

eventually I managed to look after the problem.
It turned out to be a bug in QE (PW/src/pw_restart.f90, write_gk subroutine), which does not write correctly the k+G maps for each kpt
where band parallelism is used (these maps are all identically zero instead of containing the indexes of the relevant G vectors).

QE-6.0 and QE-6.1 are both affected. The problem shows up with band parallelism only, but is harmless for QE itself since the above maps are re-calculated and not read from file during restart.

Concerning these versions, there is not much to do on the yambo side (the missing data are relevant). If you have access to QE sources and are willing to recompile, here is a fix:

change lines around 620 in qe/PW/src/pw_restart.f90
from

Code: Select all

         
         CALL mp_sum( itmp, inter_pool_comm )
         CALL mp_sum( itmp, intra_pool_comm )
to

Code: Select all

         
         CALL mp_sum( itmp, intra_bgrp_comm )
         CALL mp_sum( itmp, inter_pool_comm )

The current development version of QE (see QEF GitHub mirroring) does not show the problem (for a reason that I haven't traced).
Note however, that in order to be used with the current version of yambo one needs to add -D__OLDXML to IFLAGS/DFLAGS in make.inc
(by default it adopts a new layout of the data, not yet supported in yambo. By the time of this QE release, there will probably be a yambo release supporting the new fit, which is not yet there at the moment).

take care
Andrea
Andrea Ferretti, PhD
CNR-NANO-S3 and MaX Centre
via Campi 213/A, 41125, Modena, Italy
Tel: +39 059 2055322; Skype: andrea_ferretti
URL: http://www.nano.cnr.it

andrea.ferretti
Posts: 214
Joined: Fri Jan 31, 2014 11:13 am

Re: Paralelization incompatibility between QE and YAMBO

Post by andrea.ferretti » Thu Mar 23, 2017 10:40 am

Hi again,

the bug described above is present when band groups are used in pw.
Practically, I think this at the moment does not have a large impact since for cases other than EXX, such parallelism is not super effective
(in most cases the time is dominated by the dense linear algebra operations involved in the diagonalization).
Considering that GW/BSE calculations are then ways more expensive (meaning that the size of the DFT problem is not a big deal), one can
surely live without band parallelism at the moment (though thanks for pointing the problem out).

Concerning your input, I've also noticed few other issues which could impact performance (perhaps these were there just for the sake of showing
the problem, drop the following in case):

* force_symmorphic=.TRUE. , nosym=.FALSE. , noinv=.FALSE.
only force_symmorphic=.TRUE. is actually needed

* scf calculation + nine = 60
always perform a scf calculation followed by a nscf calculation to compute a larger number of bands
(symmetry restrictions can be applied in the second step only)

* more subtle: in scf a lower thr of convergence is used for empty states (they do not contribute to the charge density),
while in nscf a full convergence is performed for all states. in the QE language: diago_full_acc = .false. by default in
scf, while .true. in nscf.
This can somehow impact the quality of the results in a non-negligible way

* diago_david_ndim=32 : when using davidson, recent tests seem to point out that the smaller diago_david_ndim the
better. the default value is 4, one can try with diago_david_ndim=2 (should take more iterations, but with matrices with
a smaller size)
Andrea Ferretti, PhD
CNR-NANO-S3 and MaX Centre
via Campi 213/A, 41125, Modena, Italy
Tel: +39 059 2055322; Skype: andrea_ferretti
URL: http://www.nano.cnr.it

luca.montana
Posts: 27
Joined: Fri Jun 13, 2014 6:52 pm

Re: Paralelization incompatibility between QE and YAMBO

Post by luca.montana » Thu Mar 23, 2017 12:11 pm

Dear Andrea,

many many thanks for the detailed descriptions and bug fixes.
* force_symmorphic=.TRUE. , nosym=.FALSE. , noinv=.FALSE.
only force_symmorphic=.TRUE. is actually needed
That's true, this was an older input file, and i just wanted to check whether the results are consistent with and without symmetries.
* scf calculation + nine = 60
always perform a scf calculation followed by a nscf calculation to compute a larger number of bands
(symmetry restrictions can be applied in the second step only)
This was just a test to be able to quickly run the scf and spot the problem because of the crash.
What i do generally, i optimize a relatively large number of bands (400-600) self-consistently, instead of running only one nscf cycle.
Since i noticed that it can affect absorption peaks by up to 0.4 eV. It is off course system-dependent.
more subtle: in scf a lower thr of convergence is used for empty states (they do not contribute to the charge density),
while in nscf a full convergence is performed for all states. in the QE language: diago_full_acc = .false. by default in
scf, while .true. in nscf.
This can somehow impact the quality of the results in a non-negligible way
I generally set the threshold to 1.0e-08 in scf for a large number of bands (400-600). I think it should be good enough.
I further noticed that even for simple small sized systems (made out of Hydrogen and Oxygen) kinetic energy cut-off should be set to really high values up to 100 Rydberg,
in order to prevent errors in the position of absorption peaks as large as 0.2 eV.
I see in many publications people often use energy cut-off between 50 to 85 Rydberg, and this may be not very accurate.
diago_david_ndim=32 : when using davidson, recent tests seem to point out that the smaller diago_david_ndim the
better. the default value is 4, one can try with diago_david_ndim=2 (should take more iterations, but with matrices with
a smaller size)
Actually what i observed, was that the higher diago_david_ndim, the faster is calculation at the expense of more memory consummation.
I found that diago_david_ndim=32 or diago_david_ndim=16 depending on the large number of bands in scf and number of k-points turns out to result in fast DFT performance.


Thanks a lot and best wishes
Luca
Luca Montana
PhD student
University of York, UK

andrea.ferretti
Posts: 214
Joined: Fri Jan 31, 2014 11:13 am

Re: Paralelization incompatibility between QE and YAMBO

Post by andrea.ferretti » Fri Mar 24, 2017 4:30 pm

Hi Luca,
This was just a test to be able to quickly run the scf and spot the problem because of the crash.
What i do generally, i optimize a relatively large number of bands (400-600) self-consistently, instead of running only one nscf cycle.
Since i noticed that it can affect absorption peaks by up to 0.4 eV. It is off course system-dependent.
[...]
I generally set the threshold to 1.0e-08 in scf for a large number of bands (400-600). I think it should be good enough.
I am not sure this procedure is totally sound to me.
SCF is just meant to produce the charge density, while empty bands just depend on it (no need to have them involved in the SCF part).

Time ago Daniele Varsano and I did some checks on this, and SCF + NSCF (default variables) gave different results from
SCF (occupied + empty bands), e.g. in terms fo eigenvalues.
Only when setting diago_full_acc = .true. in SCF (it is true by default in NSCF), the two sets of data became consistent.

The point was that SCF lowers the accuracy on the iterative diagonalization of empty states, which do not contribute to the density, to gain some speedup.
I further noticed that even for simple small sized systems (made out of Hydrogen and Oxygen)
kinetic energy cut-off should be set to really high values up to 100 Rydberg,
in order to prevent errors in the position of absorption peaks as large as 0.2 eV.
I see in many publications people often use energy cut-off between 50 to 85 Rydberg, and this may be not very accurate.
I think this is a good point. What is the energy of the peaks you are looking at ? (high energy peaks may probably require larger cutoffs)
Actually what i observed, was that the higher diago_david_ndim, the faster is calculation at the expense of more memory consummation.
I found that diago_david_ndim=32 or diago_david_ndim=16 depending on the large number of bands in scf and number of k-points turns out to result in fast DFT performance.
interesting! (I suspect this may have to do with the number bands you require in input)

Andrea
Andrea Ferretti, PhD
CNR-NANO-S3 and MaX Centre
via Campi 213/A, 41125, Modena, Italy
Tel: +39 059 2055322; Skype: andrea_ferretti
URL: http://www.nano.cnr.it

luca.montana
Posts: 27
Joined: Fri Jun 13, 2014 6:52 pm

Re: Paralelization incompatibility between QE and YAMBO

Post by luca.montana » Fri Mar 24, 2017 8:41 pm

Dear Andrea,
Time ago Daniele Varsano and I did some checks on this, and SCF + NSCF (default variables) gave different results from
SCF (occupied + empty bands), e.g. in terms fo eigenvalues.
Only when setting diago_full_acc = .true. in SCF (it is true by default in NSCF), the two sets of data became consistent.

The point was that SCF lowers the accuracy on the iterative diagonalization of empty states, which do not contribute to the density, to gain some speedup.
In my opinion since SCF self consistently optimizes the eigenvalues of the empty bands and by setting diago_full_acc = .true. (in SCF), it should be more accurate than NSCF.
I think, this should be checked. I am not sure whether the quality of empty eigenvalues from SCF + diago_full_acc = .true. is the same as in NSCF case.
This can be the same or different from case to case.
I think this is a good point. What is the energy of the peaks you are looking at ? (high energy peaks may probably require larger cutoffs)
The energy of the peak is at 8.7 eV, it is an insulator.

I see increasingly that people in publications simply ignore the importance of the cut-off as in this paper :
https://arxiv.org/pdf/1311.1384.pdf
where a cut-off of just 400 eV is used!

interesting! (I suspect this may have to do with the number bands you require in input)
Yes, that is possible


Further, I did some energy-only self-consistent calculations (in G and X) for different number of bands at relatively high dielectric cut-off (10 Ry) on a periodic system
(because of the correlation between number of bands and size of the dielectric matrix) and found sizeable oscillations (0.5 eV)
in dependence of number of bands; i.e. for 400 bands you get a converged result after 5 iterations, for 700 bands you get another converged result after 5 iteration, for 800
bands you get a converged results which is similar to 400 band result, and for 1000 bands you get another converged result. It seems that sc-GW is very band dependent.


Best wishes
Luca
Luca Montana
PhD student
University of York, UK

Locked