Not responding calculations on collisions calculation step
Posted: Mon Nov 18, 2024 9:20 pm
I am running nonlinear calculation with yambo_nl on first step with calculation of collisions.
The program after some time starting to not responding. Cancelling and resubmitting the job may help, and collisions calculations may continue:
But when you submit the job for second step nonlinear calculations, to calculate dynamical equation, the job crashes immediately with the error:
How to improve this step?
How to avoid the program to be not responding?
I tried to run the job on 1, 2, ..., 8 nodes in parallel. The HPC cluster characteristics:
the nodes with 2x AMD processors 64 cores, 2000Gb memory per node
The program after some time starting to not responding. Cancelling and resubmitting the job may help, and collisions calculations may continue:
However, after 100% the program again is not responding. If to cancel and resubmit the job again, it will finish with "Game over" status and everything looks ok.<33m-59s> P15: Collisions |################ | [040%] 32m-03s(E) 01h-20m(X)
<46m-05s> P15: Collisions |########################################| [100%] 44m-10s(E) 44m-10s(X)
But when you submit the job for second step nonlinear calculations, to calculate dynamical equation, the job crashes immediately with the error:
The file with collisions exists:<10s> P53-r2x01: [WARNING] HXC collisions not found.
But it is appear the file .lock, which may be causes the problem.18G Nov 17 23:38 ndb.COLLISIONS_HXC
8 Nov 17 09:19 ndb.COLLISIONS_HXC-2125987841-13348.lock
420K Nov 17 22:52 ndb.COLLISIONS_HXC_header
How to improve this step?
How to avoid the program to be not responding?
I tried to run the job on 1, 2, ..., 8 nodes in parallel. The HPC cluster characteristics:
the nodes with 2x AMD processors 64 cores, 2000Gb memory per node