Dear Daniele,
Thanks for your prompt reply and kind support!
I reran a GW calculation and the report file is attached. This time, I deleted the comment I added to the ypp source code and the report file is attached too.
As for the GPU memory issue, I find that before running the polarizability calculation, the wavefunction will be copied to the device as shown in the log file as follows
Code: Select all
<24s> P1-gpu001.sulis.hpc: [MEMORY] Alloc WF%c( 38.43904 [Gb]) (HOST) TOTAL: 40.01232 [Gb] (traced) 2.050536 [Gb] (memstat)
<24s> P1-gpu001.sulis.hpc: [MEMORY] Alloc WF%c_d( 38.43904 [Gb]) (DEV) TOTAL: 39.52785 [Gb] (traced)
So here is a problem, I'm currently using A100 40GB cards, and therefore it is impossible to implement the calculation. I have tried the parallel strategy you mentioned on 30 GPUs, which is the upper limit of this cluster, but it is still not working in some systems.
I wondering if there are some solutions to further cut down memory consumption. I've tried to use the single-precision version but it seems not very effective.
Just a little thought, is it possible to set a parameter to further split the \chi calculation equally and run in serial turns? For example, further split the local conduction band indices into groups and copy the wavefunction to devices before running each group. In this way, the memory issue will be handled and convergent calculation will be realizable in more cases.
Thank you,
Mingran
You do not have the required permissions to view the files attached to this post.