BSE SLEPC - memory efficiency could be improved

malwi · Post by **malwi** » Fri Mar 28, 2025 10:17 am

Dear Yambo Developers,

BSE slepc reads large data ndb.BS_PAR_Q1 using all cpu and taking a lot of memory (6 atoms with SOC needs 400GB).
It takes 1h-7min for the case I tried, while the calculations take 3min.
Maybe it could be possible to rewrite this part for the calculations "on-the-fly"?

Below is the time report.
Best regards,
Gosia

Code: Select all

 [06] Timing Overview
 ====================

 Clock: global (MAX - min (if any spread is present) clocks)
            (...)
            io_WF :     13.1474s P30 (   607 calls,   0.022 msec avg) [MAX]      0.0208s P3 [min]
DIPOLE_transverse :     40.5606s P37 [MAX]      0.0319s P31 [min]
            (...)
          Dipoles :     42.3740s P2 [MAX]     42.3739s P42 [min]
     Slepc Solver :    208.9028s P1 [MAX]    208.8697s P41 [min]
            io_BS :      01h-07m P40 (109384 calls,   0.037 sec avg) [MAX]       52m-28s P4 (109384 calls,   0.029 sec avg) [min]

Post by **Daniele Varsano** » Fri Mar 28, 2025 10:23 am

Dear Gosia,

can you please also post the input and report file?
Some info as size of the BSE matrix and percentage of eigenvector required can be useful.

Best,

Daniele

Post by **andrea.ferretti** » Fri Mar 28, 2025 11:56 am

Dear Gosia,

besides Daniele's suggestion (which needs to be considered first),
once the input file were ok, you could consider checking this development branch

Code: Select all

https://github.com/yambo-code/yambo/tree/tech-ydiago

currently in a pull request, which is a fix contributed by a user to improve on the BSE solver
(memory footprint and solver setup)

not sure this helps but several fixes are included.

take care
Andrea

Post by **Davide Sangalli** » Fri Mar 28, 2025 1:01 pm

Dear Gosia,
the issue is that the code is doing too many calls to io_BS_PAR. Possibly because the BSE kernel was splitted in many blocks.

You can switch off the I/O of the BSE kernel, just add in input this line:

Code: Select all

DBsIOoff="BS"

Best,
D.

malwi · Post by **malwi** » Sat Apr 19, 2025 4:21 pm

Dear Daniele, Andrea and Davide,
Thank you very for your prompt answer and forgive us late reply. We make a test suite in Cyfronet and LUMI for the memory use i Yambo. I prepared cells with 6, 12, 48 and 96 atoms and 3 levels of accuracy accordig to k-points, bands and plane waves.
The long job was run by a colleague. Then my job of the same input at Ares compter was much shorter. We tested it several times and got running time 1h-7min down to 29 min all of them using 48 cpu at 1 node,
and the running time for 96 cpu at 2 nodes was only 6min. Seems that it strongly depends on the number of users.
I contiinue testing memory (up to 96 atoms). The inputs for the above case are attached.

Buona Pasqua!
Gosia

batman · Post by **batman** » Fri Apr 25, 2025 7:30 am

Dear Gosia,

Couple of things:

1) When dealing with large files, performing parallel I/O on HPCs requires manual intervention. This can have a significant impact on I/O time.
This is not related to Yambo or its underlying I/O libraries. Moreover, it depends on the specific filesystem your HPC uses.
For instance, according to https://docs.lumi-supercomputer.eu/stor ... ems/lumip/, the scratch storage is equipped with a Lustre file system.
In that case, you need to adjust some parameters (mainly the stripe count for Lustre, based on my experience). Please refer to your HPC documentation.
For LUMI, a quick Google search led me to this: https://lumi-supercomputer.github.io/LU ... 08_Lustre/
Alternatively, you can contact your HPC administrator for help.
In any case, if you don’t need to store the kernel, since it is very fast in your case, you can follow Davide’s suggestion.

2) If you are already aware of (1), you can read this. As already mentioned, the IO_BS function in Yambo is called over 100K times.
This is exactly what the HDF5/NetCDF libraries advise users to avoid when performing large parallel writes.
However, this was a kind of compromise — writing frequently allows for restarting BSE calculations, at the cost of some performance penalty. But this penalty can become significant when performing writing large files. This is not a solution, but simply an explanation of why certain things are sometimes coded this way.

Best regards,
Murali

Yambo Community Forum

BSE SLEPC - memory efficiency could be improved

BSE SLEPC - memory efficiency could be improved

Re: BSE SLEPC - memory efficiency could be improved

Re: BSE SLEPC - memory efficiency could be improved

Re: BSE SLEPC - memory efficiency could be improved

Re: BSE SLEPC - memory efficiency could be improved

Re: BSE SLEPC - memory efficiency could be improved