Page 1 of 1

Weird behaviour with NgsBlkXp

Posted: Fri Nov 17, 2023 10:38 am
by dagosta
Dear All,

I am running a Yambo calculation on a small gold cluster (4 atoms, 76 electrons). I am running some convergence calculations, in particular on the NgsBlkXp variable for the GW correction in the PPA. I was able to run successfully in a fast way (a few minutes max) up to NgsBlkXp= 9 Ry — if I select 10 Ry the code starts the calculation, but then somewhat sits there forever. I left the computer cluster run this calculation for about 10 hours (compared with the about 10 minutes needed for the 9 Ry) before killing it. Not sign of a crash or any other error was reported in the report or log files, and also the cluster didn’t signal any issue (memory, or crashing processes). Where can I look for the possible problem? How can I proceed?

A few pieces of information:
1) Running on Yambo 5.2.0
2) About 8 nodes, 384 process both MPI and OpenMP

I attach here the log of CPU 1 and report for the 9 Ry case. The 10 Ry case looks similar but it sits at X@q[1] forever.

All the best,
Roberto

Re: Weird behaviour with NgsBlkXp

Posted: Fri Nov 17, 2023 11:32 am
by Daniele Varsano
Ciao Roberto,

I can see fmor the report at 9Ry you have a X matrix which is more than 20k X 20k.

Code: Select all

X matrix size                                    :  20875
At 10 Ry it will be larger, and the inversion of such a matrix scales horribly with the dimension size.
You can try to use more CPU and increase the number in the SCALPACK procedure:
X_and_IO_nCPU_LinAlg_INV= 16
but I'm not sure that this will solve the problem, as SCALAPACK are not super efficient.
We have faced the same issue in the past and people that have analyzed the problem in deep can provide you more insight.

Best,
Daniele

Re: Weird behaviour with NgsBlkXp

Posted: Fri Nov 17, 2023 12:07 pm
by andrea.ferretti
Hi Roberto,

just to follow up on Daniele's reply,

since you are at Gamma with about 384 cores, all processors are available during the inversion of the Dyson equation for P/Chi/W.
This means that if the strategy with scalapack works, you have a lot of room for improvement

you can even use something like:
X_and_IO_nCPU_LinAlg_INV= 100
that is a 10x10 scalapack grid

cheers
Andrea

Re: Weird behaviour with NgsBlkXp

Posted: Fri Nov 17, 2023 2:38 pm
by dagosta
Dear Daniele and Andrea,

Thanks for your fast and informative answers. I have checked and from 9 Ry to 10 Ry the X matrix increases from 20875 to 24357 (a 1.2 factor) that even in the worse case scenario of matrix inversion should at most double the computational time.

I will try your recommended strategy, However, I am still quite limited with the memory so I am not sure if I can assign so many cores to assign to scalapack.

What I have also noticed is that, looking at the output of ‘top’, I get that apparently only the MPI processes are still active while the open-mp are not working (seeing a user usage of the CPU at 100% rather than the 1200% for the previous calculations).

Regards,
Roberto