Question about checkpoint restart facility in Yambo for a long GW/BSE job

sabrine · Post by **sabrine** » Wed Mar 08, 2023 1:54 pm

Dear Developers and Users,

I am currently running a long Yambo job for GW/BSE calculations in cluster, that is expected to take approximately 5 days to complete. I am concerned about the risk of job failure due to hardware or software issues, and I am interested in implementing a checkpointing strategy to mitigate this risk.

I have read about the checkpointing solution for Yambo jobs, and I understand that it involves configuring the SLURM script and modifying the input files. However, I am also wondering if there is a checkpoint restart facility in Yambo that allows me to restart the job from the last checkpoint in case of a failure, without having to rerun the entire job from the beginning.

Can you please confirm if such a checkpoint restart facility exists in Yambo, and if so, how can I enable it for my long job? Also, are there any specific modifications that I need to make to the input file or the SLURM script to use this feature?

In my search for a solution, I came across a thread in the Yambo forum (viewtopic.php?t=320&start=30) that suggests that Yambo should be able to recognize an interrupted calculation and continue from where it left off when the same input is run again. Can you please confirm if this is accurate, and if so, would it be a reliable solution to my problem?

I appreciate any guidance or resources that you can provide to help me implement a reliable checkpointing strategy for my Yambo job.
Thank you for your time and assistance.

Best regards,

Post by **Daniele Varsano** » Wed Mar 08, 2023 4:02 pm

Dear Sabrine,

a possibile strategy is the following:

Divide your calculations in steps.

1) Calculation of the screening
2) GW calculation
3) BSE calculation

1. In the calculation of the screening (e.g. plasmon pole) it is possible to restart. The calculation is done for each q point (maybe avoid parallelizing over q). All matrix elements eps^-1(q,g,g') are written on disk (ndb.pp_fragment_iq). If the code crash for any reason, just rerunning the same input file Yambo will restart the calculation from the last q point.
2. GW calculation, yambo read the screening previously calculated and start to evaluate <n| Sigma | n> elements. Here unfortunately there is not a possible restart, so the strategy here is to run small calculations, ie including few bands and k points in the %QPkrange variable. The idea is to split your %QPkrange of interests in different smaller calculations and then merge the database using the ypp utility.
3. BSE calculation has a restart when building the kernel (just rerun with the same input). it should laso have a restart for the Haydock procedure, while there is not restart for the diagonalization.

Hope it helps,

Daniele

sabrine · Post by **sabrine** » Fri Mar 10, 2023 12:02 pm

Dear Daniele,

Many thanks for the proposed strategy .

Best Regrads.
Sabrine.

Yambo Community Forum

Question about checkpoint restart facility in Yambo for a long GW/BSE job

Question about checkpoint restart facility in Yambo for a long GW/BSE job

Re: Question about checkpoint restart facility in Yambo for a long GW/BSE job

Re: Question about checkpoint restart facility in Yambo for a long GW/BSE job