COHSEX stopping without error messages

Concerns issues with computing quasiparticle corrections to the DFT eigenvalues - i.e., the self-energy within the GW approximation (-g n), or considering the Hartree-Fock exchange only (-x)

Moderators: Davide Sangalli, andrea.ferretti, myrta gruning, andrea marini, Daniele Varsano

elena.mol
Posts: 17
Joined: Mon Oct 04, 2010 10:04 am

COHSEX stopping without error messages

Post by elena.mol » Tue Oct 11, 2011 10:02 am

Dear all,
I have found problems in running COHSEX, a bit different from those mentioned by someone else in a previous post viewtopic.php?f=14&t=323 so i'm not sure the cause of the problem is the same.

Also in my case the system is a molecule, I have only 1 k-point, and i'm using yambo-3.2.4-rev.17, internal rev 855.

i'm attaching a tar containing
the input file I use for COHSEX
the log and report files
l_dbs
report and setup files from yambo installation
LIST_log = list of all the files in the directory



What always happens (in different runs, also varying input parameters such as the energy cutoff, the n. of bands..) is that the run stops at this point (I'm reporting the last lines of the log file):



<03d-02h-17m-49s> P01: G0W0 COHSEX |################### | [095%] 02d-15h-56m-30s(E) 02d-19h-18m-25s(X)
<03d-02h-19m-11s> P02: G0W0 COHSEX |################### | [095%] 02d-15h-57m-53s(E) 02d-19h-19m-52s(X)
<03d-05h-38m-27s> P01: G0W0 COHSEX |####################| [100%] 02d-19h-17m-08s(E) 02d-19h-17m-08s(X)
<03d-05h-40m-40s> [M 5.350 Gb] Free ISC-GAMP ( 1.194)
<03d-05h-40m-40s> [M 4.158 Gb] Free X ( 1.192)
<03d-05h-40m-40s> [M 1.775 Gb] Free PPaR ( 2.384)


without indicating any “ERR” in the report file, which ends like this:


[09] Dyson equation: Newton solver
==================================

[Newton] Sc step [ev]: 0.100000
[Newton] Sc steps : 1
[Newton] SC iterations :0


[09.01] G0W0 : COHSEX
=====================





The situation is the same both if I run the calculation on a local cluster, or on CINECA SP6, with the same yambo version (the one on cineca with netcdf, the one on our cluster without netcdf): the only difference is that on cineca I have an empty report file (due to different file handling , input/output procedures etc on the two different machines, I guess), and the system indicates “Segmentation fault”, but the point where the run stops, as seen from the log file, is the same.


I'm running other calculations (TDDFT and BSE) on the same system from the same /SAVE without any errors etc.


Does anyone have ideas on how to solve this problem?

Thanks a lot in advance

cheers
Elena


Elena Molteni
Department of Physics
University of Milan
via Celoria, 16
I-20133, Milan, Italy
and European Theoretical Spectroscopy Facility
(ETSF) http://www.etsf.eu


PS: I know this is a big system, and that you suggest to post “small” examples, but at the moment I have some problems with abinit (which I normally use for creating KSS files, to create then the /SAVE dir etc), so creating a “small” test system at the moment would not be so easy and quick. (for yambo calculations i'm using the /SAVE directory I created some time ago, before the start of abinit problems)
You do not have the required permissions to view the files attached to this post.

User avatar
Daniele Varsano
Posts: 3848
Joined: Tue Mar 17, 2009 2:23 pm
Contact:

Re: COHSEX stopping without error messages

Post by Daniele Varsano » Tue Oct 11, 2011 10:16 am

Dear Elena,
I do suspect it is a memory issue. From the output you post I can
see that during the run you allocate more than 5Gb.
Try to reduce same parameter of your calculation (for example FFTGvecs),
or rise the memory you ask for your run (at cineca you can do it, even if
it will be more costly in terms of processor counted).

Hope it helps,

Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/

elena.mol
Posts: 17
Joined: Mon Oct 04, 2010 10:04 am

Re: COHSEX stopping without error messages

Post by elena.mol » Wed Oct 12, 2011 10:42 am

Hi,
thanks for the suggestion.
I had the same idea at first,
but it doesn't seem to be a memory issue.

In fact yesterday i ran the calculation on cineca, asking 20 Gb / cpu, and the maximum memory used, appearing in the l... file, was 6.544 Gb, yet the calculation stopped, with a segmentation fault message, and no message like "memory exceeded" (which is what usually appears in the file /usr/spool/mail/username when it is a memory issue)

However, as an additional test, now i'm running the same calculation reducing GbndRnge and QPkrange, which, if i understand well, are, roughly speaking, the band ranges on which to calculate the GW correction(?)
What is a reasonable value for this band range? the same band range i'm considering (and trying to converge) for TDLDA and BSE calculations?


thanks again
cheers
Elena

Elena Molteni
Department of Physics
University of Milan
via Celoria, 16
I-20133, Milan, Italy
and European Theoretical Spectroscopy Facility (ETSF)
http://www.etsf.eu

User avatar
Daniele Varsano
Posts: 3848
Joined: Tue Mar 17, 2009 2:23 pm
Contact:

Re: COHSEX stopping without error messages

Post by Daniele Varsano » Wed Oct 12, 2011 12:28 pm

Dear Elena,
However, as an additional test, now i'm running the same calculation reducing GbndRnge and QPkrange, which, if i understand well, are, roughly speaking, the band ranges on which to calculate the GW correction(?)
What is a reasonable value for this band range? the same band range i'm considering (and trying to converge) for TDLDA and BSE calculations?
Note that GbndRange, is the bands include in the summations for the calculation of the GW corrections. The bands on which calculate the corrections
is controlled by the general index QPkrange. Anyway, lowering GBndRange you should lower the memory needed.
The value of the GbndRange is something that is totally system dependent and you should converge.

Best,

Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/

elena.mol
Posts: 17
Joined: Mon Oct 04, 2010 10:04 am

Re: COHSEX stopping without error messages

Post by elena.mol » Fri Oct 14, 2011 1:43 pm

Dear Daniele,
thanks for the suggestion: yes, it makes sense that i have to converge also in this case.

Apart from this, in the meantime i ran a COHSEX calculation with only 2 bands (the HOMO and the LUMO of my molecule) both in QPkrange and GbndRnge: of course it doesn't have much physical meaning, but just as a test. The result was the same, i.e., run stopped, always at the same place:

<07h-57m-03s> G0W0 COHSEX | | [000%] --(E) --(X)
<07h-57m-41s> P01: G0W0 COHSEX |########## | [050%] 37s(E) 01m-15s(X)
<07h-57m-50s> P01: G0W0 COHSEX |####################| [100%] 47s(E) 47s(X)
<07h-57m-51s> [M 4.171 Gb] Free ISC-GAMP ( 1.194)
<07h-57m-51s> [M 2.979 Gb] Free X ( 1.192)
<07h-57m-51s> [M 0.595 Gb] Free PPaR ( 2.384)


without error messages in the log or report files, no complaints about the memory, but with a segm fault message in the *.err file of the submitted job (at cineca).

It doesn't write any o..... files, but it writes db.HF_and_locXC and db.QP.


There may be some trivial error in my input file, so i attach it again (the test one with only 2 bands for GbndRnge and QPkrange), in case someone can find the "bug" in my input, or has had similar problems

By the way: maybe the huge number of "Polarization function bands" BndsRnXs is a problem? But if i understood well, this should be the n of bands used for calculating the screening, a calculation that has already been done before the current run, and which is "read" by the current run. In fact i had also tried to reduce this number, but in that case yambo gave an error on this keyword (finding it was different from what it found in the db.em1s file, if i remember well, which makes sense).

thanks again


best
Elena


Elena Molteni
Department of Physics
University of Milan
via Celoria, 16
I-20133, Milan, Italy
and European Theoretical Spectroscopy Facility (ETSF)
http://www.etsf.eu

elena.mol
Posts: 17
Joined: Mon Oct 04, 2010 10:04 am

Re: COHSEX stopping without error messages

Post by elena.mol » Fri Oct 14, 2011 3:01 pm

Sorry, i forgot to attach the input file.
Here it is.


cheers
Elena

Elena Molteni
Department of Physics
University of Milan
via Celoria, 16
I-20133, Milan, Italy
and European Theoretical Spectroscopy Facility (ETSF)
http://www.etsf.eu
You do not have the required permissions to view the files attached to this post.

User avatar
Daniele Varsano
Posts: 3848
Joined: Tue Mar 17, 2009 2:23 pm
Contact:

Re: COHSEX stopping without error messages

Post by Daniele Varsano » Sun Oct 16, 2011 9:49 pm

Dear Elena,
I did not find any problem in your input.
Moreover, take in mind that Cohsex calculations does not need unoccupied bands.
It doesn't write any o..... files, but it writes db.HF_and_locXC and db.QP.
this is stranger as if the db.QP is written, the calculation should have been finished,
as also looks from the standard output.
What about your report file? Are the energies written there?
It looks that something is not properly working, I just tried a small calculation and
it stops after calculating the G0W0 COHSEX , I will have a look as soon as possible.

Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/

User avatar
Daniele Varsano
Posts: 3848
Joined: Tue Mar 17, 2009 2:23 pm
Contact:

Re: COHSEX stopping without error messages

Post by Daniele Varsano » Mon Oct 17, 2011 9:44 am

Dear Elena,
what version of Yambo are you using?
I suggest you to update to thelast repository version (check the webpage how to do it!)
and also to link with netcdf library. I did not try to the cineca sp6, but in another machine I did not get any problem.

Cheers,

Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/

elena.mol
Posts: 17
Joined: Mon Oct 04, 2010 10:04 am

Re: COHSEX stopping without error messages

Post by elena.mol » Tue Oct 18, 2011 3:09 pm

Dear Daniele,
the version of yambo i'm using is "yambo-3.2.4-r.855" (downloaded as a .tar from http://qe-forge.org/frs/?group_id=34, where it is indicated as yambo-3.2.4-rev.17.tar.gz).

Regarding SVN: on our local cluster, svn seems to be already installed, so, following the instructions on http://www.yambo-code.org/download.php, i just registered to QE-FORGE, and then i submitted a "request to join" to the yambo project. As soon as i will get the approval mail, i will do:
svn checkout svn+ssh://myqeforgeusername@qe-forge.org/svn/yambo .
I've tried to do that now, and it doesn't work, and/or complains about permissions, so i guessed it's due to the fact that the procedure is not completed yet.

By the way (since i've never used svn before): how can i check which is the most recent yambo version on svn? (to compare to the ones available as tarballs)
If i connect to http://qe-forge.org/projects/yambo and go to "yambo" --> "files", i see the same versions i can see when i choose to download the tarballs......

Regarding NETCDF: on our local cluster i didn't succeed in installing yambo linking to netcdf (so i just installed it without netcdf),
but i have the same problem with COHSEX on cineca, where i have installed yambo with netcdf.

And finally, regarding the report file, i'm attaching an example of it: it is not from the "2 band" calculation (i ran this on cineca, where it doesn't write report files unless it finishes regularly), but from a different one; however the pathology, as seen from the log file, was the same, so it's equivalent.

many thanks again
Best

Elena

Elena Molteni
Department of Physics
University of Milan
via Celoria, 16
I-20133, Milan, Italy
and European Theoretical Spectroscopy Facility (ETSF)
http://www.etsf.eu
You do not have the required permissions to view the files attached to this post.

User avatar
Daniele Varsano
Posts: 3848
Joined: Tue Mar 17, 2009 2:23 pm
Contact:

Re: COHSEX stopping without error messages

Post by Daniele Varsano » Tue Oct 18, 2011 5:02 pm

Dear Elena,
the svn version is the last version and constantly updated,
in order to have access to the svn, you have to upload in the qeforge your
public ssh key. as explained in the FAQ.
In he qe-forge web site you should find the instructions, now I do not remember which is
the exact procedure. The new version should solve your problems, I hope.
I'm sure that the netcdf are not related on your problem, it was just a suggestion, even if not
installed in your local cluster you can freely download here and install it.
Cheers,

Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/

Post Reply