Dear all,
I compiled the GPU version of yambo-5.1.2 based on NVHPC SDK.
But, baesd on my test, I found the GPU version of yambo shows much low efficency than that of CPU version.
For the same test run, it takes 1.5 minutes for CPU version of yambo, but it takes 6.5 minute for GPU version of yambo. Obviously, it is a confusing results.
The GPU devices I used is RTX 3090.
Details can be found in the attacments.
Is there some tips for the variable in inputfile?
Thanks in advance.
low efficency of GPU version
Moderators: Davide Sangalli, andrea.ferretti, myrta gruning, andrea marini, Daniele Varsano
-
- Posts: 109
- Joined: Thu Oct 10, 2019 7:03 am
low efficency of GPU version
You do not have the required permissions to view the files attached to this post.
Dr. Yimin Ding
Soochow University, China.
Soochow University, China.
-
- Posts: 214
- Joined: Fri Jan 31, 2014 11:13 am
Re: low efficency of GPU version
Dear Dean,
thanks for writing.
Concerning the GPU/CPU performance of yambo, typically we expect some good acceleration when running on state-of-the-art GPUs for
HPC (eg nvidia A100 or alike).
If the acceleration is lacking, there could be multiple reasons triggering the behaviour:
I notice that most of the time spent is in the GW(PPA) routine, and that you use the terminator algorithm (BG).
Terminator is not ported on GPUs in the public version (it has been developed but not released yet, probably coming with the next major release)...
This could be one of the reasons.
Moreover, your system looks rather small...
BTW: are CPU and GPU nodes the same in therms of HW ? (I see you are running using 20 MPI tasks in one case and 2 MPI * 4 threads on the other case)
If the HW is the same, probably using 2 MPI tasks + 10 threads in the GPU case would work better
Andrea
thanks for writing.
Concerning the GPU/CPU performance of yambo, typically we expect some good acceleration when running on state-of-the-art GPUs for
HPC (eg nvidia A100 or alike).
If the acceleration is lacking, there could be multiple reasons triggering the behaviour:
- * the system size is too small (too much data communication wrt computation), which is especially critical on GPUs with smaller band to memory
* one runs with GPU oversubscription (more MPI tasks on a single GPU)
* general miscompilation issues
I notice that most of the time spent is in the GW(PPA) routine, and that you use the terminator algorithm (BG).
Terminator is not ported on GPUs in the public version (it has been developed but not released yet, probably coming with the next major release)...
This could be one of the reasons.
Moreover, your system looks rather small...
BTW: are CPU and GPU nodes the same in therms of HW ? (I see you are running using 20 MPI tasks in one case and 2 MPI * 4 threads on the other case)
If the HW is the same, probably using 2 MPI tasks + 10 threads in the GPU case would work better
Andrea
Andrea Ferretti, PhD
CNR-NANO-S3 and MaX Centre
via Campi 213/A, 41125, Modena, Italy
Tel: +39 059 2055322; Skype: andrea_ferretti
URL: http://www.nano.cnr.it
CNR-NANO-S3 and MaX Centre
via Campi 213/A, 41125, Modena, Italy
Tel: +39 059 2055322; Skype: andrea_ferretti
URL: http://www.nano.cnr.it
-
- Posts: 109
- Joined: Thu Oct 10, 2019 7:03 am
Re: low efficency of GPU version
Dear andrea.ferretti,
Thanks for your reply.
I found a 4 times of speed up without terminator. So it is the main factor. Then, it is expected to port terminator on GPUs in the public version.
Best,
Thanks for your reply.
I found a 4 times of speed up without terminator. So it is the main factor. Then, it is expected to port terminator on GPUs in the public version.
Best,
Dr. Yimin Ding
Soochow University, China.
Soochow University, China.