Page 1 of 1

GPU specific run-time optimization parameters.

Posted: Sun Jul 25, 2021 9:30 am
by mbaris
Dear developers,

I'm testing out Yambo in a cluster with 8xA100 80GB cards linked through nvlink per node. In this regard, may I ask if there are parameters I can set at runtime to optimize the use of the card? Also, I would like to mention that the "Cheat sheet" link in Yambo GitHub page requires permission for access.
  • Is nvlink memory pooling implemented? i.e. can I pool the memory of the cards together for a total of 640 GB per node with only a few CPU allocations? Even better, can I use the new nvlink over PCI to pool the memory of the CPU for a total of 1640 GB per node? What is the runtime procedure to do so?
  • I have noticed that for simple test runs (i.e. HBN example), the GPU memory usage is capped at a value per CPU task (i.e. 800 MB in HBN). When I increase the number of CPU tasks, this remains constant (i.e. for 16 CPU processes I am using 16x800 MB in the GPU). Is there a way to parallelize tasks only using GPU until it fills the GPU memory without increasing the number of CPU tasks? It seems the code really does not like the presence of CUDA_MPS hence that method seems to be blocked.
  • I have noticed a reference to "ChiLinAlgMod" parameter in GitHub, however, I was unable to find a reference on how to use this parameter, i.e. in the wiki entry. Are there other run-time parameters that I can use for the GPU offloading? Is there a documentation/benchmark study that documents best use scenarios for such parameters?

Re: GPU specific run-time optimization parameters.

Posted: Mon Jul 26, 2021 12:58 pm
by andrea.ferretti
Dear mbaris,

first of all, pls sign your posts.

Coming to your questions, let me begin by mentioning a few facts about yambo@GPUs.

* Yambo implements a simple and direct GPU support using CUDA-Fortran. All GPU allocations are handled explicitly.
* CUDA-aware MPI is not used (or not expected to have a significant impact, since no tight MPI communication is performed).
* Standard usage is 1 MPI per GPU, oversubscription is possible but not recommended (yambo typically runs short with memory)

now:

* nvlink: no explicit implementation in yambo. Not sure how to run yambo sharing all the GPU memory...
surely, yambo needs to be run with 1 MPI, then I expect one needs to leverage the environment (and I would recommend asking to a sys admin)

* turning a single small calculation into a bunch of smaller pieces that can be run in parallel over the same GPU may be feasible somehow, but not sure how relevant.
The typical situation is that yambo exposes calculations that are large enough to take clear advantage of the GPUs (hBN and other simple tests are just not representative of real life runs).
In case you are really interested in pushing this idea of splitting a single yambo run into pieces we can try to discuss it further

* Default parameters are typically already optimal to run on GPUs, while advanced variables such as "ChiLinAlgMod " are used to bring back to the CPU
some parts of the calculation (here the linear system solver to get X out of Xo). This is achieved by adding "CPU" to the variable string, e.g.
ChiLinAlgMod = "lin_system, CPU"

here the source line with a bit of documentation
'ChiLinAlgMod', '[X] inversion/lin_sys,cpu/gpu',Chi_linalg_mode,verb_level=V_resp,case="A")

hope this helps
Andrea