low efficency of GPU version

Concerns issues with computing quasiparticle corrections to the DFT eigenvalues - i.e., the self-energy within the GW approximation (-g n), or considering the Hartree-Fock exchange only (-x)

Moderators: Davide Sangalli, andrea.ferretti, myrta gruning, andrea marini, Daniele Varsano

Post Reply
Dean
Posts: 98
Joined: Thu Oct 10, 2019 7:03 am

low efficency of GPU version

Post by Dean » Mon Apr 24, 2023 11:10 am

Dear all,
I compiled the GPU version of yambo-5.1.2 based on NVHPC SDK.
But, baesd on my test, I found the GPU version of yambo shows much low efficency than that of CPU version.
For the same test run, it takes 1.5 minutes for CPU version of yambo, but it takes 6.5 minute for GPU version of yambo. Obviously, it is a confusing results.
The GPU devices I used is RTX 3090.
Details can be found in the attacments.
Is there some tips for the variable in inputfile?
Thanks in advance.
You do not have the required permissions to view the files attached to this post.
Dr. Yimin Ding
Soochow University, China.

andrea.ferretti
Posts: 206
Joined: Fri Jan 31, 2014 11:13 am

Re: low efficency of GPU version

Post by andrea.ferretti » Wed Apr 26, 2023 9:00 am

Dear Dean,

thanks for writing.
Concerning the GPU/CPU performance of yambo, typically we expect some good acceleration when running on state-of-the-art GPUs for
HPC (eg nvidia A100 or alike).
If the acceleration is lacking, there could be multiple reasons triggering the behaviour:
  • * the system size is too small (too much data communication wrt computation), which is especially critical on GPUs with smaller band to memory
    * one runs with GPU oversubscription (more MPI tasks on a single GPU)
    * general miscompilation issues
According to your input/output files, you have no oversubscription (good!) and everything looks ok.
I notice that most of the time spent is in the GW(PPA) routine, and that you use the terminator algorithm (BG).
Terminator is not ported on GPUs in the public version (it has been developed but not released yet, probably coming with the next major release)...

This could be one of the reasons.
Moreover, your system looks rather small...

BTW: are CPU and GPU nodes the same in therms of HW ? (I see you are running using 20 MPI tasks in one case and 2 MPI * 4 threads on the other case)
If the HW is the same, probably using 2 MPI tasks + 10 threads in the GPU case would work better

Andrea
Andrea Ferretti, PhD
CNR-NANO-S3 and MaX Centre
via Campi 213/A, 41125, Modena, Italy
Tel: +39 059 2055322; Skype: andrea_ferretti
URL: http://www.nano.cnr.it

Dean
Posts: 98
Joined: Thu Oct 10, 2019 7:03 am

Re: low efficency of GPU version

Post by Dean » Wed Apr 26, 2023 11:37 am

Dear andrea.ferretti,
Thanks for your reply.
I found a 4 times of speed up without terminator. So it is the main factor. Then, it is expected to port terminator on GPUs in the public version.
Best,
Dr. Yimin Ding
Soochow University, China.

Post Reply