WFbuffIO

Various technical topics such as parallelism and efficiency, netCDF problems, the Yambo code structure itself, are posted here.

Moderators: Davide Sangalli, andrea.ferretti, myrta gruning, andrea marini, Daniele Varsano, Conor Hogan, Nicola Spallanzani

robwarm
Posts: 6
Joined: Mon Jun 29, 2015 11:49 am
Location: Johannesburg
Contact:

WFbuffIO

Post by robwarm » Thu Jul 02, 2015 10:33 am

Good morning,

I am using Yambo 4.0.1.89. In (or after) the PPA part of a G0W0 calculation yambo would crash often, usually with a NetCDF error message (missing file or so). I wasn't able to track down the problem, however it disappeared when enabling the WFbuffIO flag, which unfortunately is not yet documented, as far as I know.
I am rather sure, that there is no problem with Yambo and relatively sure NetCDF works fine. (That's why I am not attaching the files.) As I said, the errors are not very consistent, so could be file system issues.

Three questions:
1. I run the calculations on a network file system (NFSv4). Is yambo sensitive to these rather slow file systems?
2. Is there any disadvantage/limitations of using the WFbuffIO flag, like increased memory usage?
3. What is the objected use case for WFbuffIO?

Best
Robert
Dr Robert Warmbier
Materials for Energy Research Group
School of Physics
University of the Witwatersrand, Johannesburg

andrea.ferretti
Posts: 206
Joined: Fri Jan 31, 2014 11:13 am

Re: WFbuffIO

Post by andrea.ferretti » Sat Jul 04, 2015 11:50 pm

Dear Robert,

> 1. I run the calculations on a network file system (NFSv4). Is yambo sensitive to these rather slow file systems?

well, yambo from time to time does some intense IO and a very slow FS can be a problem to performance, but I would not say
this can cause a failure.
Some common reasons for failures connected to netcdf files can be cured by specifying db fragmentation (should be the default in v4.0.1, but
please check whether you are allowed to use the -S option starting from p2y on, and, if you are allowed use it)

> 2. Is there any disadvantage/limitations of using the WFbuffIO flag, like increased memory usage?

the WFbuffIO keyword enables the possibility to reduce IO of wavefunctions from disk by
storing some more memory and fishing wf's from memory (through MPi comm if required, I think).
so, some more memory usage for sure... not sure about how more (this option is still under testing)

> 3. What is the objected use case for WFbuffIO?

This keyword was first implemented to port and run yambo on a blue-gene Q machine, which, as far as I understand,
has a relatively slow IO to disk, while a relatively fast interconnect between nodes. Unfortunately memory is also an issue here.

take care
Andrea
Andrea Ferretti, PhD
CNR-NANO-S3 and MaX Centre
via Campi 213/A, 41125, Modena, Italy
Tel: +39 059 2055322; Skype: andrea_ferretti
URL: http://www.nano.cnr.it

robwarm
Posts: 6
Joined: Mon Jun 29, 2015 11:49 am
Location: Johannesburg
Contact:

Re: WFbuffIO

Post by robwarm » Mon Jul 06, 2015 12:59 pm

Dear Andrea,

thank you for your reply, very informative.

p2y doesn't have a -S option (I tried, it says invalid option).
The 'standard' db fragmentation in my example is one fragment per kpoint. Standard means without specifying anything.
I would run a system with 10 fragments on 12 mpi threads, for example. Sometimes it works, sometimes not. It also depends on the parameters and the position of the moon. ;-) A student mentioned a similar 24 kpoint example would run on 8 mpi thread, but crash with 12...

Anyway, I guess I have to check the file system.

Have a good day.

Robert
Dr Robert Warmbier
Materials for Energy Research Group
School of Physics
University of the Witwatersrand, Johannesburg

User avatar
Daniele Varsano
Posts: 3816
Joined: Tue Mar 17, 2009 2:23 pm
Contact:

Re: WFbuffIO

Post by Daniele Varsano » Mon Jul 06, 2015 1:08 pm

Dear Robert,
p2y doesn't have a -S option (I tried, it says invalid option).
in 4.x it is not valid option anymore, db are fragmented by default. Actually we are implementing some control from input about fragmentation as it can happen that the code produce too many files, or contrary few files with too many variables for the netcdf. Still testing and we hope to distribute this feature soon.
The 'standard' db fragmentation in my example is one fragment per kpoint. Standard means without specifying anything.
Yes, for the screening it should be one fragment for each qpoint.
I would run a system with 10 fragments on 12 mpi threads, for example. Sometimes it works, sometimes not. It also depends on the parameters and the position of the moon. ;-) A student mentioned a similar 24 kpoint example would run on 8 mpi thread, but crash with 12...
Ok, it can happen, moon position always has a role!!! Anyway if you want you can post input/report/standard output in order to have a look to the different situation and understand the problem.

Best,
Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/

robwarm
Posts: 6
Joined: Mon Jun 29, 2015 11:49 am
Location: Johannesburg
Contact:

Re: WFbuffIO

Post by robwarm » Mon Jul 06, 2015 2:10 pm

Dear Daniele,

I attached the files. I hope tar files are ok.
The first time I ran the yambo g0w0 it got stuck in [05] External QP corrections (X)
I cancelled the calculation and removed the LOG and output files and folders (g0w0 folder and the r- file).

Funnily, then the [05] External QP corrections (X) and [06] External QP corrections (G) are missing and PPA becomes step 05 instead of 07.

Could that be related to my problem?

When using WFbuffIO, the External QP corrections (X) and [06] External QP corrections (G) are not present at all. (Starting from a clean yambo folder) Hm, it crashed now the same way, the attached calculation crashed... weird.

Best Robert
You do not have the required permissions to view the files attached to this post.
Dr Robert Warmbier
Materials for Energy Research Group
School of Physics
University of the Witwatersrand, Johannesburg

User avatar
Daniele Varsano
Posts: 3816
Joined: Tue Mar 17, 2009 2:23 pm
Contact:

Re: WFbuffIO

Post by Daniele Varsano » Tue Jul 07, 2015 1:13 pm

Dear Robert,
most probably the problem arise because of the parallelization strategy.
X_all_q_CPU and SE_CPU should contains power of two.
Depending on your computer architectures, try to use 8 (if you have 12 cpu nodes) or 16, and set the variable accordingly, and see if this fix the problem.

Best,
Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/

robwarm
Posts: 6
Joined: Mon Jun 29, 2015 11:49 am
Location: Johannesburg
Contact:

Re: WFbuffIO

Post by robwarm » Tue Jul 07, 2015 1:34 pm

Dear Daniele,

if I have a 12 core node and run a power of 2 set-up, like you suggest, isn't that going to be inefficient?

Like 2*4*1*1 = 8 would not use all cores and 4*4*1*1 would require more cores. Or am I getting something wrong?

Best Robert

PS: I will still try of course
Dr Robert Warmbier
Materials for Energy Research Group
School of Physics
University of the Witwatersrand, Johannesburg

User avatar
Daniele Varsano
Posts: 3816
Joined: Tue Mar 17, 2009 2:23 pm
Contact:

Re: WFbuffIO

Post by Daniele Varsano » Tue Jul 07, 2015 1:39 pm

Dear Robert,
, isn't that going to be inefficient?
well it depends, of course you are not using all the cpus, but you have more RAM/cpu as usual it depends on the needs.
Anyway, at the moment we are stacked by working in power of 2 (the parallelization strategy has been thought to work for BG archutectures), so better not totally efficient than anything.

Best,
Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/

robwarm
Posts: 6
Joined: Mon Jun 29, 2015 11:49 am
Location: Johannesburg
Contact:

Re: WFbuffIO

Post by robwarm » Tue Jul 07, 2015 1:46 pm

I tried 8 MPI threads with X_all_q_CPU= "2 2 2 1" and SE_CPU= "2 1 4". Same error.

Best Robert
Dr Robert Warmbier
Materials for Energy Research Group
School of Physics
University of the Witwatersrand, Johannesburg

User avatar
Daniele Varsano
Posts: 3816
Joined: Tue Mar 17, 2009 2:23 pm
Contact:

Re: WFbuffIO

Post by Daniele Varsano » Tue Jul 07, 2015 1:56 pm

Dear Robert,
same error is not useful:
please post input/report/launch script/error

What happen if you do not specify the parallel variable, ie the code will use the default?
Of course, before running yambo delete the previous generated databases: ndb.pp_fragment_*

Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/

Post Reply