Question about checkpoint restart facility in Yambo for a long GW/BSE job
Posted: Wed Mar 08, 2023 1:54 pm
Dear Developers and Users,
I am currently running a long Yambo job for GW/BSE calculations in cluster, that is expected to take approximately 5 days to complete. I am concerned about the risk of job failure due to hardware or software issues, and I am interested in implementing a checkpointing strategy to mitigate this risk.
I have read about the checkpointing solution for Yambo jobs, and I understand that it involves configuring the SLURM script and modifying the input files. However, I am also wondering if there is a checkpoint restart facility in Yambo that allows me to restart the job from the last checkpoint in case of a failure, without having to rerun the entire job from the beginning.
Can you please confirm if such a checkpoint restart facility exists in Yambo, and if so, how can I enable it for my long job? Also, are there any specific modifications that I need to make to the input file or the SLURM script to use this feature?
In my search for a solution, I came across a thread in the Yambo forum (viewtopic.php?t=320&start=30) that suggests that Yambo should be able to recognize an interrupted calculation and continue from where it left off when the same input is run again. Can you please confirm if this is accurate, and if so, would it be a reliable solution to my problem?
I appreciate any guidance or resources that you can provide to help me implement a reliable checkpointing strategy for my Yambo job.
Thank you for your time and assistance.
Best regards,
I am currently running a long Yambo job for GW/BSE calculations in cluster, that is expected to take approximately 5 days to complete. I am concerned about the risk of job failure due to hardware or software issues, and I am interested in implementing a checkpointing strategy to mitigate this risk.
I have read about the checkpointing solution for Yambo jobs, and I understand that it involves configuring the SLURM script and modifying the input files. However, I am also wondering if there is a checkpoint restart facility in Yambo that allows me to restart the job from the last checkpoint in case of a failure, without having to rerun the entire job from the beginning.
Can you please confirm if such a checkpoint restart facility exists in Yambo, and if so, how can I enable it for my long job? Also, are there any specific modifications that I need to make to the input file or the SLURM script to use this feature?
In my search for a solution, I came across a thread in the Yambo forum (viewtopic.php?t=320&start=30) that suggests that Yambo should be able to recognize an interrupted calculation and continue from where it left off when the same input is run again. Can you please confirm if this is accurate, and if so, would it be a reliable solution to my problem?
I appreciate any guidance or resources that you can provide to help me implement a reliable checkpointing strategy for my Yambo job.
Thank you for your time and assistance.
Best regards,