The program after some time starting to not responding. Cancelling and resubmitting the job may help, and collisions calculations may continue:
However, after 100% the program again is not responding. If to cancel and resubmit the job again, it will finish with "Game over" status and everything looks ok.<33m-59s> P15: Collisions |################ | [040%] 32m-03s(E) 01h-20m(X)
<46m-05s> P15: Collisions |########################################| [100%] 44m-10s(E) 44m-10s(X)
But when you submit the job for second step nonlinear calculations, to calculate dynamical equation, the job crashes immediately with the error:
The file with collisions exists:<10s> P53-r2x01: [WARNING] HXC collisions not found.
But it is appear the file .lock, which may be causes the problem.18G Nov 17 23:38 ndb.COLLISIONS_HXC
8 Nov 17 09:19 ndb.COLLISIONS_HXC-2125987841-13348.lock
420K Nov 17 22:52 ndb.COLLISIONS_HXC_header
How to improve this step?
How to avoid the program to be not responding?
I tried to run the job on 1, 2, ..., 8 nodes in parallel. The HPC cluster characteristics:
the nodes with 2x AMD processors 64 cores, 2000Gb memory per node