Page 1 of 2

How to restart in BSE Kernel loop?

Posted: Wed Jan 03, 2018 1:51 am
by wufeng
Dear all,
Due to the walltime limit I must split a large BSE calculation into several part. The screening has been done, and the BS matrix build step (Kernel loop) will take a long time. However, when I tried to restart the calculatioin, either it restarts from the start (just like no ndb.BS_Q1_CPU_* files present) or it hangs up. Is there any specific settings required to guarantee the restart? What I did is just to rerun with the same input file after the job was killed by the job management system.

I found this question has been asked but did not get what exactly should be done to correctly restart in BSE kernal step viewtopic.php?f=13&t=796. Thanks very much.

Re: How to restart in BSE Kernel loop?

Posted: Wed Jan 03, 2018 10:36 am
by Daniele Varsano
Dear Feng Wu,
unfortunately at this moment the restart of the BS kernel is problematic, we are working on that.
If you post your report and log files we can have a look and see if we can suggest an optimal parallelisation strategy in order
to have the calculations done in a reasonable wall time.

Best,

Daniele

Re: How to restart in BSE Kernel loop?

Posted: Thu Jan 04, 2018 1:27 am
by wufeng
Dear Daniel,
Thanks for the information.

A log file tarball is attached. This is a case with 4 processors per node * 8 nodes. Due to memory limit, not all cores are used (instead, 8 openmp threads per CPU), and k-eh-t = 1-32-1.
I would really appreciate if you could provide some advices about this.


Best,
Feng

Re: How to restart in BSE Kernel loop?

Posted: Thu Jan 04, 2018 3:32 pm
by Daniele Varsano
Dear Feng,
from the log you attached it looks the calculations finished correctly.
Anyway, as you can see from the warning message in the log file:

Code: Select all

<03s> P0001: [WARNING] n_eh_CPU > 1 in a system with symmetries and k-points is not efficient. Try distributing first "k" and "t"
it is more efficient to parallelize over the k points first. The report it is not attached, but it seems you have 4 points right? This should also distribute the memory, and if this the case you can use more cpu of your nodes.
In any case, maintaining your number of cores, a strategy as k-eh-t = 4-1-8, should perform much better.
Just a curiosity, your BSE matrix looks huge, and you do not have many k points, how many conduction and valence bands are you including? Perhaps you can reduce the number of valence bands included in the BSE?

Best,

Daniele

Re: How to restart in BSE Kernel loop?

Posted: Thu Jan 04, 2018 7:34 pm
by wufeng
Dear Daniele,
This a a pretty large system, with 1280 valence electrons so there are really a lot of bands. I would like to find out the limit of system size we can run on this cluster. Thanks for your advice and I will try it.

I have another question about the k-point parallization: once I tried the k-point parallization but found some process run significantly longer than others. For example, eh-only runs 18 hours on all processors, and k-only runs 4/8/16/24 hours on different processors based LOG files; so the total time is much less in k-point parallization (18*4 > 4+8+16+24), but the WALL time is actually longer (24 > 18). I am sorry I cannot find the LOG files now. Is this behaviour expected?

Best,
Feng

Re: How to restart in BSE Kernel loop?

Posted: Fri Jan 05, 2018 9:39 am
by Daniele Varsano
Dear Feng Wu,
There may be some unbalance but I would not expect such a large discrepancy, may be something happened there, but without report and logs it is hard to say.

Best,
Daniele

Re: How to restart in BSE Kernel loop?

Posted: Sat Jan 06, 2018 12:59 am
by wufeng
Dear Daniele,
Thanks. I will try to report if this can be reproduced.


Feng

Re: How to restart in BSE Kernel loop?

Posted: Sat Jan 06, 2018 11:30 pm
by Davide Sangalli
Dear Feng Wu,
few more comments.

The unbalance you find in the "k"-only parallelization is not so strange to me.
Yambo distributes the points in the IBZ while the BSE matrix is computed in the whole BZ.
The reason is that some matrix elements can be symmetry related.

As Daniele pointed out, and you indeed found out, the parallelization over "eh" is balanced but not efficient.

I think instead the parallelization over "t"-only could be both balanced and efficient.
The maximum number of processors you can use for that is roughly N*(N+1)/2, with N=nk*neh and
nk= nmber of kpt in the full BZ
neh= number of cores used for eh parallelization .

Finally I see you are also using 8 threads. Not fully sure of the effect.
The OpenMP parallelism has not been tested much with BSE.

Hope it helps.
Best,
D.

Re: How to restart in BSE Kernel loop?

Posted: Fri Jan 12, 2018 12:29 am
by wufeng
The unbalance you find in the "k"-only parallelization is not so strange to me.
Yambo distributes the points in the IBZ while the BSE matrix is computed in the whole BZ.
The reason is that some matrix elements can be symmetry related.

As Daniele pointed out, and you indeed found out, the parallelization over "eh" is balanced but not efficient.

I think instead the parallelization over "t"-only could be both balanced and efficient.
The maximum number of processors you can use for that is roughly N*(N+1)/2, with N=nk*neh and
nk= nmber of kpt in the full BZ
neh= number of cores used for eh parallelization .
Thanks very much for the details. I have attached my LOG file with 3 different parallielization settings.
  • 1. k=4 eh=16, time from 8h27min to 1d06h45min
    2. eh=64, time from 16h50min to 18h43min
    3. eh=16 t=4, time from 15h50min to 19h28min

Also the OpenMP have no effect in BSE part in my other test.

Re: How to restart in BSE Kernel loop?

Posted: Fri Jan 12, 2018 12:00 pm
by Davide Sangalli
Thanks for the report.
Did you try to just distribute over "t" ?

Best,
D.