eval_G_minus_G in G0W0 calculation
Posted: Sun Sep 08, 2024 3:56 pm
Dear Yambo developers and user community!
I have recently started using Yambo, specifically the 5.1 GPU-version as pre-compiled on the Leonardo cluster, in order to compute the G0W0 levels for an interface of an organic molecule with a transition metal dichalcogenide monolayer (108 atoms, non-colinear spin, SOC, pseudodojo std FR pseudopotentials).
In contrast to my previous experience with GW in a plane-wave basis, as well as other published benchmarks for Yambo and similar-sized systems, the majority of the execution time is spent on the QP part and not the part for calculating and inverting the screening matrix, as can be seen in the following timing overview for a test calculation with very coarse parameters.
My question to the community would be what exactly “eval_G_minus_G” does, if it is expected to take so much time and what I could do to reduce this time?
Another, probably completely unrelated question: what does the runlevel keyword “el_el_corr” exactly do? I don’t think I have to set it explicitly since the code seems to behave the same when leaving it out. Unfortunately I cannot find any documentation for it on the website.
In addition, I also can’t get the “RIM_W” method working in this system, but this is maybe for another question and after more testing...
Please find my Yambo input and log files in the attachment.
I have recently started using Yambo, specifically the 5.1 GPU-version as pre-compiled on the Leonardo cluster, in order to compute the G0W0 levels for an interface of an organic molecule with a transition metal dichalcogenide monolayer (108 atoms, non-colinear spin, SOC, pseudodojo std FR pseudopotentials).
In contrast to my previous experience with GW in a plane-wave basis, as well as other published benchmarks for Yambo and similar-sized systems, the majority of the execution time is spent on the QP part and not the part for calculating and inverting the screening matrix, as can be seen in the following timing overview for a test calculation with very coarse parameters.
Code: Select all
Clock: global (MAX - min (if any spread is present) clocks)
io_KB_abinit : 0.0002s P11 [MAX] 0.0000s P16 [min]
GW(REDUX) : 1.1889s P6 [MAX] 0.0001s P2 [min]
io_ATMPROJ_pwscf : 0.0003s P6 [MAX] 0.0001s P24 [min]
PP_uspp_init : 0.0008s P4 ( 4 calls, 0.212 msec avg) [MAX] 0.0001s P14 ( 4 calls, 0.033 msec avg) [min]
io_Double_Grid : 0.0007s P15 [MAX] 0.0002s P21 [min]
HF(REDUX) : 2.5536s P4 [MAX] 0.0002s P13 [min]
IO_and_Messaging_switch : 0.0004s P23 (20 calls, 0.019 msec avg) [MAX] 0.0003s P9 (18 calls, 0.018 msec avg) [min]
scatter_Gamp_gpu : 0.0005s P18 ( 5 calls, 0.090 msec avg) [MAX] 0.0004s P9 ( 4 calls, 0.098 msec avg) [min]
io_COL_CUT : 0.0110s P1 ( 4 calls, 2.757 msec avg) [MAX] 0.0006s P3 ( 4 calls, 0.158 msec avg) [min]
io_HF : 0.0112s P1 ( 3 calls, 3.718 msec avg) [MAX] 0.0015s P9 ( 3 calls, 0.510 msec avg) [min]
io_QP_and_GF : 0.0173s P1
Coulomb Cutoff : 0.0321s P1 [MAX] 0.0219s P3 [min]
io_fragment : 1.0901s P1 (13 calls, 0.084 msec avg) [MAX] 0.0250s P2 ( 9 calls, 2.780 msec avg) [min]
io_DIPOLES : 0.1645s P13 (11 calls, 14.951 msec avg) [MAX] 0.0271s P4 (11 calls, 2.460 msec avg) [min]
io_QINDX : 0.0357s P2 ( 2 calls, 17.859 msec avg) [MAX] 0.0299s P23 ( 2 calls, 14.925 msec avg) [min]
io_GROT : 0.1022s P4 ( 2 calls, 51.094 msec avg) [MAX] 0.0976s P14 ( 2 calls, 48.785 msec avg) [min]
HF : 0.3195s P11 [MAX] 0.1618s P18 [min]
io_KB_pwscf : 0.5979s P18 ( 4 calls, 149.483 msec avg) [MAX] 0.3468s P3 ( 3 calls, 115.585 msec avg) [min]
FFT_setup : 0.8290s P4 ( 6 calls, 138.167 msec avg) [MAX] 0.7497s P18 ( 6 calls, 124.956 msec avg) [min]
io_X : 14.6568s P1 (15 calls, 0.977 sec avg) [MAX] 2.3222s P2 (15 calls, 0.155 sec avg) [min]
RIM : 2.9418s P1 [MAX] 2.9415s P8 [min]
io_WF : 3.9067s P24 (16 calls, 0.244 sec avg) [MAX] 3.0791s P7 (14 calls, 0.220 sec avg) [min]
WF_load_FFT : 5.6959s P24 ( 4 calls, 1.424 sec avg) [MAX] 4.5049s P7 ( 4 calls, 1.126 sec avg) [min]
SERIAL_lin_system_gpu : 6.5425s P2 ( 4 calls, 1.636 sec avg) [MAX] 6.5404s P1 ( 4 calls, 1.635 sec avg) [min]
LINEAR ALGEBRA : 6.5426s P2 ( 4 calls, 1.636 sec avg) [MAX] 6.5404s P1 ( 4 calls, 1.635 sec avg) [min]
MATRIX transfer (X_G_finite_q_X_redux_1_1) : 12.5667s P1 ( 4 calls, 3.142 sec avg) [MAX] 10.4240s P2 ( 4 calls, 2.606 sec avg) [min]
MATRIX transfer (X_G_finite_q_X_redux_2_1) : 12.5351s P1 ( 4 calls, 3.134 sec avg) [MAX] 10.4620s P2 ( 4 calls, 2.616 sec avg) [min]
MATRIX transfer (X_G_finite_q_X_redux_2_2) : 13.4594s P1 ( 4 calls, 3.365 sec avg) [MAX] 10.4910s P6 ( 4 calls, 2.623 sec avg) [min]
MATRIX transfer (X_G_finite_q_X_redux_1_2) : 13.5120s P2 ( 4 calls, 3.378 sec avg) [MAX] 10.5284s P5 ( 4 calls, 2.632 sec avg) [min]
XC_potential_driver : 11.2249s P15 ( 2 calls, 5.612 sec avg) [MAX] 11.1765s P10 ( 2 calls, 5.588 sec avg) [min]
XCo_local : 11.4011s P13 [MAX] 11.4010s P6 [min]
X (REDUX) : 50.0431s P16 ( 4 calls, 12.511 sec avg) [MAX] 31.8334s P1 ( 4 calls, 7.958 sec avg) [min]
X (procedure) : 51.4831s P1 ( 4 calls, 12.871 sec avg) [MAX] 34.7290s P16 ( 4 calls, 8.682 sec avg) [min]
Xo (REDUX) : 129.3615s P7 ( 4 calls, 32.340 sec avg) [MAX] 38.7388s P24 ( 4 calls, 9.685 sec avg) [min]
GW(ppa) : 65.4550s P2 [MAX] 63.5850s P24 [min]
Xo (procedure) : 311.1742s P24 ( 4 calls, 77.794 sec avg) [MAX] 219.1872s P7 ( 4 calls, 54.797 sec avg) [min]
DIPOLE_transverse : 299.1638s P5 [MAX] 287.2146s P16 [min]
Dipoles : 300.6506s P15 [MAX] 300.6505s P4 [min]
eval_G_minus_G : 02h-29m P20 [MAX] 02h-24m P6 [min]
Another, probably completely unrelated question: what does the runlevel keyword “el_el_corr” exactly do? I don’t think I have to set it explicitly since the code seems to behave the same when leaving it out. Unfortunately I cannot find any documentation for it on the website.
In addition, I also can’t get the “RIM_W” method working in this system, but this is maybe for another question and after more testing...
Please find my Yambo input and log files in the attachment.