How to restart in BSE Kernel loop?
Moderators: Davide Sangalli, andrea.ferretti, myrta gruning, andrea marini, Daniele Varsano, Conor Hogan
-
- Posts: 24
- Joined: Fri Dec 15, 2017 4:17 am
How to restart in BSE Kernel loop?
Dear all,
Due to the walltime limit I must split a large BSE calculation into several part. The screening has been done, and the BS matrix build step (Kernel loop) will take a long time. However, when I tried to restart the calculatioin, either it restarts from the start (just like no ndb.BS_Q1_CPU_* files present) or it hangs up. Is there any specific settings required to guarantee the restart? What I did is just to rerun with the same input file after the job was killed by the job management system.
I found this question has been asked but did not get what exactly should be done to correctly restart in BSE kernal step viewtopic.php?f=13&t=796. Thanks very much.
Due to the walltime limit I must split a large BSE calculation into several part. The screening has been done, and the BS matrix build step (Kernel loop) will take a long time. However, when I tried to restart the calculatioin, either it restarts from the start (just like no ndb.BS_Q1_CPU_* files present) or it hangs up. Is there any specific settings required to guarantee the restart? What I did is just to rerun with the same input file after the job was killed by the job management system.
I found this question has been asked but did not get what exactly should be done to correctly restart in BSE kernal step viewtopic.php?f=13&t=796. Thanks very much.
Feng Wu
Chemistry and Biochemistry department,
University of California, Santa Cruz
95064 CA, United States
Chemistry and Biochemistry department,
University of California, Santa Cruz
95064 CA, United States
- Daniele Varsano
- Posts: 4231
- Joined: Tue Mar 17, 2009 2:23 pm
- Contact:
Re: How to restart in BSE Kernel loop?
Dear Feng Wu,
unfortunately at this moment the restart of the BS kernel is problematic, we are working on that.
If you post your report and log files we can have a look and see if we can suggest an optimal parallelisation strategy in order
to have the calculations done in a reasonable wall time.
Best,
Daniele
unfortunately at this moment the restart of the BS kernel is problematic, we are working on that.
If you post your report and log files we can have a look and see if we can suggest an optimal parallelisation strategy in order
to have the calculations done in a reasonable wall time.
Best,
Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/
-
- Posts: 24
- Joined: Fri Dec 15, 2017 4:17 am
Re: How to restart in BSE Kernel loop?
Dear Daniel,
Thanks for the information.
A log file tarball is attached. This is a case with 4 processors per node * 8 nodes. Due to memory limit, not all cores are used (instead, 8 openmp threads per CPU), and k-eh-t = 1-32-1.
I would really appreciate if you could provide some advices about this.
Best,
Feng
Thanks for the information.
A log file tarball is attached. This is a case with 4 processors per node * 8 nodes. Due to memory limit, not all cores are used (instead, 8 openmp threads per CPU), and k-eh-t = 1-32-1.
I would really appreciate if you could provide some advices about this.
Best,
Feng
You do not have the required permissions to view the files attached to this post.
Feng Wu
Chemistry and Biochemistry department,
University of California, Santa Cruz
95064 CA, United States
Chemistry and Biochemistry department,
University of California, Santa Cruz
95064 CA, United States
- Daniele Varsano
- Posts: 4231
- Joined: Tue Mar 17, 2009 2:23 pm
- Contact:
Re: How to restart in BSE Kernel loop?
Dear Feng,
from the log you attached it looks the calculations finished correctly.
Anyway, as you can see from the warning message in the log file:
it is more efficient to parallelize over the k points first. The report it is not attached, but it seems you have 4 points right? This should also distribute the memory, and if this the case you can use more cpu of your nodes.
In any case, maintaining your number of cores, a strategy as k-eh-t = 4-1-8, should perform much better.
Just a curiosity, your BSE matrix looks huge, and you do not have many k points, how many conduction and valence bands are you including? Perhaps you can reduce the number of valence bands included in the BSE?
Best,
Daniele
from the log you attached it looks the calculations finished correctly.
Anyway, as you can see from the warning message in the log file:
Code: Select all
<03s> P0001: [WARNING] n_eh_CPU > 1 in a system with symmetries and k-points is not efficient. Try distributing first "k" and "t"
In any case, maintaining your number of cores, a strategy as k-eh-t = 4-1-8, should perform much better.
Just a curiosity, your BSE matrix looks huge, and you do not have many k points, how many conduction and valence bands are you including? Perhaps you can reduce the number of valence bands included in the BSE?
Best,
Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/
-
- Posts: 24
- Joined: Fri Dec 15, 2017 4:17 am
Re: How to restart in BSE Kernel loop?
Dear Daniele,
This a a pretty large system, with 1280 valence electrons so there are really a lot of bands. I would like to find out the limit of system size we can run on this cluster. Thanks for your advice and I will try it.
I have another question about the k-point parallization: once I tried the k-point parallization but found some process run significantly longer than others. For example, eh-only runs 18 hours on all processors, and k-only runs 4/8/16/24 hours on different processors based LOG files; so the total time is much less in k-point parallization (18*4 > 4+8+16+24), but the WALL time is actually longer (24 > 18). I am sorry I cannot find the LOG files now. Is this behaviour expected?
Best,
Feng
This a a pretty large system, with 1280 valence electrons so there are really a lot of bands. I would like to find out the limit of system size we can run on this cluster. Thanks for your advice and I will try it.
I have another question about the k-point parallization: once I tried the k-point parallization but found some process run significantly longer than others. For example, eh-only runs 18 hours on all processors, and k-only runs 4/8/16/24 hours on different processors based LOG files; so the total time is much less in k-point parallization (18*4 > 4+8+16+24), but the WALL time is actually longer (24 > 18). I am sorry I cannot find the LOG files now. Is this behaviour expected?
Best,
Feng
Feng Wu
Chemistry and Biochemistry department,
University of California, Santa Cruz
95064 CA, United States
Chemistry and Biochemistry department,
University of California, Santa Cruz
95064 CA, United States
- Daniele Varsano
- Posts: 4231
- Joined: Tue Mar 17, 2009 2:23 pm
- Contact:
Re: How to restart in BSE Kernel loop?
Dear Feng Wu,
There may be some unbalance but I would not expect such a large discrepancy, may be something happened there, but without report and logs it is hard to say.
Best,
Daniele
There may be some unbalance but I would not expect such a large discrepancy, may be something happened there, but without report and logs it is hard to say.
Best,
Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/
-
- Posts: 24
- Joined: Fri Dec 15, 2017 4:17 am
Re: How to restart in BSE Kernel loop?
Dear Daniele,
Thanks. I will try to report if this can be reproduced.
Feng
Thanks. I will try to report if this can be reproduced.
Feng
Feng Wu
Chemistry and Biochemistry department,
University of California, Santa Cruz
95064 CA, United States
Chemistry and Biochemistry department,
University of California, Santa Cruz
95064 CA, United States
- Davide Sangalli
- Posts: 643
- Joined: Tue May 29, 2012 4:49 pm
- Location: Via Salaria Km 29.3, CP 10, 00016, Monterotondo Stazione, Italy
- Contact:
Re: How to restart in BSE Kernel loop?
Dear Feng Wu,
few more comments.
The unbalance you find in the "k"-only parallelization is not so strange to me.
Yambo distributes the points in the IBZ while the BSE matrix is computed in the whole BZ.
The reason is that some matrix elements can be symmetry related.
As Daniele pointed out, and you indeed found out, the parallelization over "eh" is balanced but not efficient.
I think instead the parallelization over "t"-only could be both balanced and efficient.
The maximum number of processors you can use for that is roughly N*(N+1)/2, with N=nk*neh and
nk= nmber of kpt in the full BZ
neh= number of cores used for eh parallelization .
Finally I see you are also using 8 threads. Not fully sure of the effect.
The OpenMP parallelism has not been tested much with BSE.
Hope it helps.
Best,
D.
few more comments.
The unbalance you find in the "k"-only parallelization is not so strange to me.
Yambo distributes the points in the IBZ while the BSE matrix is computed in the whole BZ.
The reason is that some matrix elements can be symmetry related.
As Daniele pointed out, and you indeed found out, the parallelization over "eh" is balanced but not efficient.
I think instead the parallelization over "t"-only could be both balanced and efficient.
The maximum number of processors you can use for that is roughly N*(N+1)/2, with N=nk*neh and
nk= nmber of kpt in the full BZ
neh= number of cores used for eh parallelization .
Finally I see you are also using 8 threads. Not fully sure of the effect.
The OpenMP parallelism has not been tested much with BSE.
Hope it helps.
Best,
D.
Davide Sangalli, PhD
CNR-ISM, Division of Ultrafast Processes in Materials (FLASHit) and MaX Centre
https://sites.google.com/view/davidesangalli
http://www.max-centre.eu/
CNR-ISM, Division of Ultrafast Processes in Materials (FLASHit) and MaX Centre
https://sites.google.com/view/davidesangalli
http://www.max-centre.eu/
-
- Posts: 24
- Joined: Fri Dec 15, 2017 4:17 am
Re: How to restart in BSE Kernel loop?
Thanks very much for the details. I have attached my LOG file with 3 different parallielization settings.The unbalance you find in the "k"-only parallelization is not so strange to me.
Yambo distributes the points in the IBZ while the BSE matrix is computed in the whole BZ.
The reason is that some matrix elements can be symmetry related.
As Daniele pointed out, and you indeed found out, the parallelization over "eh" is balanced but not efficient.
I think instead the parallelization over "t"-only could be both balanced and efficient.
The maximum number of processors you can use for that is roughly N*(N+1)/2, with N=nk*neh and
nk= nmber of kpt in the full BZ
neh= number of cores used for eh parallelization .
- 1. k=4 eh=16, time from 8h27min to 1d06h45min
2. eh=64, time from 16h50min to 18h43min
3. eh=16 t=4, time from 15h50min to 19h28min
Also the OpenMP have no effect in BSE part in my other test.
You do not have the required permissions to view the files attached to this post.
Feng Wu
Chemistry and Biochemistry department,
University of California, Santa Cruz
95064 CA, United States
Chemistry and Biochemistry department,
University of California, Santa Cruz
95064 CA, United States
- Davide Sangalli
- Posts: 643
- Joined: Tue May 29, 2012 4:49 pm
- Location: Via Salaria Km 29.3, CP 10, 00016, Monterotondo Stazione, Italy
- Contact:
Re: How to restart in BSE Kernel loop?
Thanks for the report.
Did you try to just distribute over "t" ?
Best,
D.
Did you try to just distribute over "t" ?
Best,
D.
Davide Sangalli, PhD
CNR-ISM, Division of Ultrafast Processes in Materials (FLASHit) and MaX Centre
https://sites.google.com/view/davidesangalli
http://www.max-centre.eu/
CNR-ISM, Division of Ultrafast Processes in Materials (FLASHit) and MaX Centre
https://sites.google.com/view/davidesangalli
http://www.max-centre.eu/