I'm testing out Yambo in a cluster with 8xA100 80GB cards linked through nvlink per node. In this regard, may I ask if there are parameters I can set at runtime to optimize the use of the card? Also, I would like to mention that the "Cheat sheet" link in Yambo GitHub page requires permission for access.
- Is nvlink memory pooling implemented? i.e. can I pool the memory of the cards together for a total of 640 GB per node with only a few CPU allocations? Even better, can I use the new nvlink over PCI to pool the memory of the CPU for a total of 1640 GB per node? What is the runtime procedure to do so?
- I have noticed that for simple test runs (i.e. HBN example), the GPU memory usage is capped at a value per CPU task (i.e. 800 MB in HBN). When I increase the number of CPU tasks, this remains constant (i.e. for 16 CPU processes I am using 16x800 MB in the GPU). Is there a way to parallelize tasks only using GPU until it fills the GPU memory without increasing the number of CPU tasks? It seems the code really does not like the presence of CUDA_MPS hence that method seems to be blocked.
- I have noticed a reference to "ChiLinAlgMod" parameter in GitHub, however, I was unable to find a reference on how to use this parameter, i.e. in the wiki entry. Are there other run-time parameters that I can use for the GPU offloading? Is there a documentation/benchmark study that documents best use scenarios for such parameters?