Dear Developers and Users,
I am currently running a long Yambo job for GW/BSE calculations in cluster, that is expected to take approximately 5 days to complete. I am concerned about the risk of job failure due to hardware or software issues, and I am interested in implementing a checkpointing strategy to mitigate this risk.
I have read about the checkpointing solution for Yambo jobs, and I understand that it involves configuring the SLURM script and modifying the input files. However, I am also wondering if there is a checkpoint restart facility in Yambo that allows me to restart the job from the last checkpoint in case of a failure, without having to rerun the entire job from the beginning.
Can you please confirm if such a checkpoint restart facility exists in Yambo, and if so, how can I enable it for my long job? Also, are there any specific modifications that I need to make to the input file or the SLURM script to use this feature?
In my search for a solution, I came across a thread in the Yambo forum (viewtopic.php?t=320&start=30) that suggests that Yambo should be able to recognize an interrupted calculation and continue from where it left off when the same input is run again. Can you please confirm if this is accurate, and if so, would it be a reliable solution to my problem?
I appreciate any guidance or resources that you can provide to help me implement a reliable checkpointing strategy for my Yambo job.
Thank you for your time and assistance.
Best regards,
Question about checkpoint restart facility in Yambo for a long GW/BSE job
Moderators: Davide Sangalli, andrea.ferretti, myrta gruning, andrea marini, Daniele Varsano
-
- Posts: 21
- Joined: Tue Apr 26, 2022 3:05 pm
- Location: Paris , France
Question about checkpoint restart facility in Yambo for a long GW/BSE job
Dr.Sabrine Ayari
Laboratoire de Physique de lÉcole normale supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université Paris-Diderot, Sorbonne Paris Cité, Paris, France
Laboratoire de Physique de lÉcole normale supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université Paris-Diderot, Sorbonne Paris Cité, Paris, France
- Daniele Varsano
- Posts: 4209
- Joined: Tue Mar 17, 2009 2:23 pm
- Contact:
Re: Question about checkpoint restart facility in Yambo for a long GW/BSE job
Dear Sabrine,
a possibile strategy is the following:
Divide your calculations in steps.
1) Calculation of the screening
2) GW calculation
3) BSE calculation
1. In the calculation of the screening (e.g. plasmon pole) it is possible to restart. The calculation is done for each q point (maybe avoid parallelizing over q). All matrix elements eps^-1(q,g,g') are written on disk (ndb.pp_fragment_iq). If the code crash for any reason, just rerunning the same input file Yambo will restart the calculation from the last q point.
2. GW calculation, yambo read the screening previously calculated and start to evaluate <n| Sigma | n> elements. Here unfortunately there is not a possible restart, so the strategy here is to run small calculations, ie including few bands and k points in the %QPkrange variable. The idea is to split your %QPkrange of interests in different smaller calculations and then merge the database using the ypp utility.
3. BSE calculation has a restart when building the kernel (just rerun with the same input). it should laso have a restart for the Haydock procedure, while there is not restart for the diagonalization.
Hope it helps,
Daniele
a possibile strategy is the following:
Divide your calculations in steps.
1) Calculation of the screening
2) GW calculation
3) BSE calculation
1. In the calculation of the screening (e.g. plasmon pole) it is possible to restart. The calculation is done for each q point (maybe avoid parallelizing over q). All matrix elements eps^-1(q,g,g') are written on disk (ndb.pp_fragment_iq). If the code crash for any reason, just rerunning the same input file Yambo will restart the calculation from the last q point.
2. GW calculation, yambo read the screening previously calculated and start to evaluate <n| Sigma | n> elements. Here unfortunately there is not a possible restart, so the strategy here is to run small calculations, ie including few bands and k points in the %QPkrange variable. The idea is to split your %QPkrange of interests in different smaller calculations and then merge the database using the ypp utility.
3. BSE calculation has a restart when building the kernel (just rerun with the same input). it should laso have a restart for the Haydock procedure, while there is not restart for the diagonalization.
Hope it helps,
Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/
-
- Posts: 21
- Joined: Tue Apr 26, 2022 3:05 pm
- Location: Paris , France
Re: Question about checkpoint restart facility in Yambo for a long GW/BSE job
Dear Daniele,
Many thanks for the proposed strategy .
Best Regrads.
Sabrine.
Many thanks for the proposed strategy .
Best Regrads.
Sabrine.
Dr.Sabrine Ayari
Laboratoire de Physique de lÉcole normale supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université Paris-Diderot, Sorbonne Paris Cité, Paris, France
Laboratoire de Physique de lÉcole normale supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université Paris-Diderot, Sorbonne Paris Cité, Paris, France