WFbuffIO

robwarm · Post by **robwarm** » Thu Jul 02, 2015 10:33 am

Good morning,

I am using Yambo 4.0.1.89. In (or after) the PPA part of a G0W0 calculation yambo would crash often, usually with a NetCDF error message (missing file or so). I wasn't able to track down the problem, however it disappeared when enabling the WFbuffIO flag, which unfortunately is not yet documented, as far as I know.
I am rather sure, that there is no problem with Yambo and relatively sure NetCDF works fine. (That's why I am not attaching the files.) As I said, the errors are not very consistent, so could be file system issues.

Three questions:
1. I run the calculations on a network file system (NFSv4). Is yambo sensitive to these rather slow file systems?
2. Is there any disadvantage/limitations of using the WFbuffIO flag, like increased memory usage?
3. What is the objected use case for WFbuffIO?

Best
Robert

Post by **andrea.ferretti** » Sat Jul 04, 2015 11:50 pm

Dear Robert,

> 1. I run the calculations on a network file system (NFSv4). Is yambo sensitive to these rather slow file systems?

well, yambo from time to time does some intense IO and a very slow FS can be a problem to performance, but I would not say
this can cause a failure.
Some common reasons for failures connected to netcdf files can be cured by specifying db fragmentation (should be the default in v4.0.1, but
please check whether you are allowed to use the -S option starting from p2y on, and, if you are allowed use it)

> 2. Is there any disadvantage/limitations of using the WFbuffIO flag, like increased memory usage?

the WFbuffIO keyword enables the possibility to reduce IO of wavefunctions from disk by
storing some more memory and fishing wf's from memory (through MPi comm if required, I think).
so, some more memory usage for sure... not sure about how more (this option is still under testing)

> 3. What is the objected use case for WFbuffIO?

This keyword was first implemented to port and run yambo on a blue-gene Q machine, which, as far as I understand,
has a relatively slow IO to disk, while a relatively fast interconnect between nodes. Unfortunately memory is also an issue here.

take care
Andrea

robwarm · Post by **robwarm** » Mon Jul 06, 2015 12:59 pm

Dear Andrea,

thank you for your reply, very informative.

p2y doesn't have a -S option (I tried, it says invalid option).
The 'standard' db fragmentation in my example is one fragment per kpoint. Standard means without specifying anything.
I would run a system with 10 fragments on 12 mpi threads, for example. Sometimes it works, sometimes not. It also depends on the parameters and the position of the moon.

A student mentioned a similar 24 kpoint example would run on 8 mpi thread, but crash with 12...

Anyway, I guess I have to check the file system.

Have a good day.

Robert

Post by **Daniele Varsano** » Mon Jul 06, 2015 1:08 pm

Dear Robert,

p2y doesn't have a -S option (I tried, it says invalid option).

in 4.x it is not valid option anymore, db are fragmented by default. Actually we are implementing some control from input about fragmentation as it can happen that the code produce too many files, or contrary few files with too many variables for the netcdf. Still testing and we hope to distribute this feature soon.

The 'standard' db fragmentation in my example is one fragment per kpoint. Standard means without specifying anything.

Yes, for the screening it should be one fragment for each qpoint.

I would run a system with 10 fragments on 12 mpi threads, for example. Sometimes it works, sometimes not. It also depends on the parameters and the position of the moon. A student mentioned a similar 24 kpoint example would run on 8 mpi thread, but crash with 12...

Ok, it can happen, moon position always has a role!!! Anyway if you want you can post input/report/standard output in order to have a look to the different situation and understand the problem.

Best,
Daniele

robwarm · Post by **robwarm** » Mon Jul 06, 2015 2:10 pm

Dear Daniele,

I attached the files. I hope tar files are ok.
The first time I ran the yambo g0w0 it got stuck in [05] External QP corrections (X)
I cancelled the calculation and removed the LOG and output files and folders (g0w0 folder and the r- file).

Funnily, then the [05] External QP corrections (X) and [06] External QP corrections (G) are missing and PPA becomes step 05 instead of 07.

Could that be related to my problem?

When using WFbuffIO, the External QP corrections (X) and [06] External QP corrections (G) are not present at all. (Starting from a clean yambo folder) Hm, it crashed now the same way, the attached calculation crashed... weird.

Best Robert

Post by **Daniele Varsano** » Tue Jul 07, 2015 1:13 pm

Dear Robert,
most probably the problem arise because of the parallelization strategy.
X_all_q_CPU and SE_CPU should contains power of two.
Depending on your computer architectures, try to use 8 (if you have 12 cpu nodes) or 16, and set the variable accordingly, and see if this fix the problem.

Best,
Daniele

robwarm · Post by **robwarm** » Tue Jul 07, 2015 1:34 pm

Dear Daniele,

if I have a 12 core node and run a power of 2 set-up, like you suggest, isn't that going to be inefficient?

Like 2*4*1*1 = 8 would not use all cores and 4*4*1*1 would require more cores. Or am I getting something wrong?

Best Robert

PS: I will still try of course

Post by **Daniele Varsano** » Tue Jul 07, 2015 1:39 pm

Dear Robert,

, isn't that going to be inefficient?

well it depends, of course you are not using all the cpus, but you have more RAM/cpu as usual it depends on the needs.
Anyway, at the moment we are stacked by working in power of 2 (the parallelization strategy has been thought to work for BG archutectures), so better not totally efficient than anything.

Best,
Daniele

robwarm · Post by **robwarm** » Tue Jul 07, 2015 1:46 pm

I tried 8 MPI threads with X_all_q_CPU= "2 2 2 1" and SE_CPU= "2 1 4". Same error.

Best Robert

Post by **Daniele Varsano** » Tue Jul 07, 2015 1:56 pm

Dear Robert,
same error is not useful:
please post input/report/launch script/error

What happen if you do not specify the parallel variable, ie the code will use the default?
Of course, before running yambo delete the previous generated databases: ndb.pp_fragment_*

Daniele

Yambo Community Forum

WFbuffIO

WFbuffIO

Re: WFbuffIO

Re: WFbuffIO

Re: WFbuffIO

Re: WFbuffIO

Re: WFbuffIO

Re: WFbuffIO

Re: WFbuffIO

Re: WFbuffIO

Re: WFbuffIO