Page 1 of 1

Not responding calculations on collisions calculation step

Posted: Mon Nov 18, 2024 9:20 pm
by DmitrySkachkov
I am running nonlinear calculation with yambo_nl on first step with calculation of collisions.

The program after some time starting to not responding. Cancelling and resubmitting the job may help, and collisions calculations may continue:
<33m-59s> P15: Collisions |################ | [040%] 32m-03s(E) 01h-20m(X)
<46m-05s> P15: Collisions |########################################| [100%] 44m-10s(E) 44m-10s(X)
However, after 100% the program again is not responding. If to cancel and resubmit the job again, it will finish with "Game over" status and everything looks ok.
But when you submit the job for second step nonlinear calculations, to calculate dynamical equation, the job crashes immediately with the error:
<10s> P53-r2x01: [WARNING] HXC collisions not found.
The file with collisions exists:
18G Nov 17 23:38 ndb.COLLISIONS_HXC
8 Nov 17 09:19 ndb.COLLISIONS_HXC-2125987841-13348.lock
420K Nov 17 22:52 ndb.COLLISIONS_HXC_header
But it is appear the file .lock, which may be causes the problem.

How to improve this step?
How to avoid the program to be not responding?

I tried to run the job on 1, 2, ..., 8 nodes in parallel. The HPC cluster characteristics:
the nodes with 2x AMD processors 64 cores, 2000Gb memory per node

Re: Not responding calculations on collisions calculation step

Posted: Wed Nov 20, 2024 2:36 pm
by claudio
Dear Dmitry Skachkov

try to delete the collisions databases and recalculate them
using yambo_rt.
yambo_rt is compiled in single precision, it requires less memory/disk space
and it is much faster.
I can guarantee that the final result will exactly the same.
Notice however that to run the non-linear dynamics you always need yambo_nl

let me knwo if it works, otherwise we can search for other solutions

best
Claudio

Re: Not responding calculations on collisions calculation step

Posted: Mon Nov 25, 2024 9:27 pm
by DmitrySkachkov
Hi Claudio,

I followed your suggestions, compiled another version of yambo with single precision, and used yambo_rt instead of yambo_nl to calculate collisions.
Yambo_rt works much better, more stable, however, I have the same problem, after some time of calculation, the task stopped to response, and does not write log file.
If for yambo_nl the program is stopping to response at ~30% of collision calculation, now with single precision yambo_rt, the program starting not to response at ~60% of collision calculations.

I am attaching my input file for collision calculation with yambo_rt, the k-mesh is 18x18x1. The computer cluster has 32Gb/core memory, so no problems with memory.

Could you please recommend how to eliminate this problem with not responding, probably, also, by reducing the parameters of the system.

Thank you,
Dmitry