Page 1 of 1

BSE SLEPC - memory efficiency could be improved

Posted: Fri Mar 28, 2025 10:17 am
by malwi
Dear Yambo Developers,

BSE slepc reads large data ndb.BS_PAR_Q1 using all cpu and taking a lot of memory (6 atoms with SOC needs 400GB).
It takes 1h-7min for the case I tried, while the calculations take 3min.
Maybe it could be possible to rewrite this part for the calculations "on-the-fly"?

Below is the time report.
Best regards,
Gosia

Code: Select all

 [06] Timing Overview
 ====================

 Clock: global (MAX - min (if any spread is present) clocks)
            (...)
            io_WF :     13.1474s P30 (   607 calls,   0.022 msec avg) [MAX]      0.0208s P3 [min]
DIPOLE_transverse :     40.5606s P37 [MAX]      0.0319s P31 [min]
            (...)
          Dipoles :     42.3740s P2 [MAX]     42.3739s P42 [min]
     Slepc Solver :    208.9028s P1 [MAX]    208.8697s P41 [min]
            io_BS :      01h-07m P40 (109384 calls,   0.037 sec avg) [MAX]       52m-28s P4 (109384 calls,   0.029 sec avg) [min]

Re: BSE SLEPC - memory efficiency could be improved

Posted: Fri Mar 28, 2025 10:23 am
by Daniele Varsano
Dear Gosia,

can you please also post the input and report file?
Some info as size of the BSE matrix and percentage of eigenvector required can be useful.

Best,

Daniele

Re: BSE SLEPC - memory efficiency could be improved

Posted: Fri Mar 28, 2025 11:56 am
by andrea.ferretti
Dear Gosia,

besides Daniele's suggestion (which needs to be considered first),
once the input file were ok, you could consider checking this development branch

Code: Select all

https://github.com/yambo-code/yambo/tree/tech-ydiago
currently in a pull request, which is a fix contributed by a user to improve on the BSE solver
(memory footprint and solver setup)

not sure this helps but several fixes are included.

take care
Andrea

Re: BSE SLEPC - memory efficiency could be improved

Posted: Fri Mar 28, 2025 1:01 pm
by Davide Sangalli
Dear Gosia,
the issue is that the code is doing too many calls to io_BS_PAR. Possibly because the BSE kernel was splitted in many blocks.

You can switch off the I/O of the BSE kernel, just add in input this line:

Code: Select all

DBsIOoff="BS"
Best,
D.

Re: BSE SLEPC - memory efficiency could be improved

Posted: Sat Apr 19, 2025 4:21 pm
by malwi
Dear Daniele, Andrea and Davide,
Thank you very for your prompt answer and forgive us late reply. We make a test suite in Cyfronet and LUMI for the memory use i Yambo. I prepared cells with 6, 12, 48 and 96 atoms and 3 levels of accuracy accordig to k-points, bands and plane waves.
The long job was run by a colleague. Then my job of the same input at Ares compter was much shorter. We tested it several times and got running time 1h-7min down to 29 min all of them using 48 cpu at 1 node,
and the running time for 96 cpu at 2 nodes was only 6min. Seems that it strongly depends on the number of users.
I contiinue testing memory (up to 96 atoms). The inputs for the above case are attached.

Buona Pasqua!
Gosia

Re: BSE SLEPC - memory efficiency could be improved

Posted: Fri Apr 25, 2025 7:30 am
by batman
Dear Gosia,

Couple of things:

1) When dealing with large files, performing parallel I/O on HPCs requires manual intervention. This can have a significant impact on I/O time.
This is not related to Yambo or its underlying I/O libraries. Moreover, it depends on the specific filesystem your HPC uses.
For instance, according to https://docs.lumi-supercomputer.eu/stor ... ems/lumip/, the scratch storage is equipped with a Lustre file system.
In that case, you need to adjust some parameters (mainly the stripe count for Lustre, based on my experience). Please refer to your HPC documentation.
For LUMI, a quick Google search led me to this: https://lumi-supercomputer.github.io/LU ... 08_Lustre/
Alternatively, you can contact your HPC administrator for help.
In any case, if you don’t need to store the kernel, since it is very fast in your case, you can follow Davide’s suggestion.

2) If you are already aware of (1), you can read this. As already mentioned, the IO_BS function in Yambo is called over 100K times.
This is exactly what the HDF5/NetCDF libraries advise users to avoid when performing large parallel writes.
However, this was a kind of compromise — writing frequently allows for restarting BSE calculations, at the cost of some performance penalty. But this penalty can become significant when performing writing large files. This is not a solution, but simply an explanation of why certain things are sometimes coded this way.

Best regards,
Murali