Page 1 of 1

slow performance after a X0 is finished

Posted: Sat Feb 17, 2024 6:34 am
by xyf
Dear developers,

I am working with yambo 5.1.1 for quasiparticle energies using PPA. But in the dynamical dielectric matrix stage, when one X0 is finished, yambo seems to be trapped in some work. Only after several hours yambo will continue his work and calculate X. Like the log below:

Code: Select all

 <04h-03m> P1-cnode399: Xo@q[2] |                                        | [000%] --(E) --(X)
 <04h-03m> P1-cnode399: [MEMORY] Alloc Xo_res( 136.6050 [Mb]) TOTAL:  10.39174 [Gb] (traced)  117.5200 [Mb] (memstat)
 <04h-05m> P1-cnode399: Xo@q[2] |#                                       | [002%] 02m-13s(E) 01h-28m(X)
 <04h-07m> P1-cnode399: Xo@q[2] |##                                      | [005%] 04m-25s(E) 01h-28m(X)
 <04h-09m> P1-cnode399: Xo@q[2] |###                                     | [007%] 06m-37s(E) 01h-28m(X)
 <04h-11m> P1-cnode399: Xo@q[2] |####                                    | [010%] 08m-49s(E) 01h-28m(X)
 <04h-14m> P1-cnode399: Xo@q[2] |#####                                   | [012%] 11m-01s(E) 01h-28m(X)
 <04h-16m> P1-cnode399: Xo@q[2] |######                                  | [015%] 13m-12s(E) 01h-28m(X)
 <04h-18m> P1-cnode399: Xo@q[2] |#######                                 | [017%] 15m-24s(E) 01h-28m(X)
 <04h-20m> P1-cnode399: Xo@q[2] |########                                | [020%] 17m-36s(E) 01h-28m(X)
 <04h-22m> P1-cnode399: Xo@q[2] |#########                               | [022%] 19m-48s(E) 01h-28m(X)
 <04h-25m> P1-cnode399: Xo@q[2] |##########                              | [025%] 21m-59s(E) 01h-27m(X)
 <04h-27m> P1-cnode399: Xo@q[2] |###########                             | [027%] 24m-11s(E) 01h-27m(X)
 <04h-29m> P1-cnode399: Xo@q[2] |############                            | [030%] 26m-22s(E) 01h-27m(X)
 <04h-31m> P1-cnode399: Xo@q[2] |#############                           | [032%] 28m-34s(E) 01h-27m(X)
 <04h-33m> P1-cnode399: Xo@q[2] |##############                          | [035%] 30m-47s(E) 01h-27m(X)
 <04h-36m> P1-cnode399: Xo@q[2] |###############                         | [037%] 32m-58s(E) 01h-27m(X)
 <04h-38m> P1-cnode399: Xo@q[2] |################                        | [040%] 35m-10s(E) 01h-27m(X)
 <04h-40m> P1-cnode399: Xo@q[2] |#################                       | [042%] 37m-23s(E) 01h-27m(X)
 <04h-42m> P1-cnode399: Xo@q[2] |##################                      | [045%] 39m-35s(E) 01h-27m(X)
 <04h-44m> P1-cnode399: Xo@q[2] |###################                     | [047%] 41m-46s(E) 01h-27m(X)
 <04h-47m> P1-cnode399: Xo@q[2] |####################                    | [050%] 43m-58s(E) 01h-27m(X)
 <04h-49m> P1-cnode399: Xo@q[2] |#####################                   | [052%] 46m-09s(E) 01h-27m(X)
 <04h-51m> P1-cnode399: Xo@q[2] |######################                  | [055%] 48m-20s(E) 01h-27m(X)
 <04h-53m> P1-cnode399: Xo@q[2] |#######################                 | [057%] 50m-32s(E) 01h-27m(X)
 <04h-55m> P1-cnode399: Xo@q[2] |########################                | [060%] 52m-44s(E) 01h-27m(X)
 <04h-57m> P1-cnode399: Xo@q[2] |#########################               | [062%] 54m-56s(E) 01h-27m(X)
 <05h-00m> P1-cnode399: Xo@q[2] |##########################              | [065%] 57m-08s(E) 01h-27m(X)
 <05h-02m> P1-cnode399: Xo@q[2] |###########################             | [067%] 59m-21s(E) 01h-27m(X)
 <05h-04m> P1-cnode399: Xo@q[2] |############################            | [070%] 01h-01m(E) 01h-27m(X)
 <05h-06m> P1-cnode399: Xo@q[2] |#############################           | [072%] 01h-03m(E) 01h-27m(X)
 <05h-08m> P1-cnode399: Xo@q[2] |##############################          | [075%] 01h-05m(E) 01h-27m(X)
 <05h-11m> P1-cnode399: Xo@q[2] |###############################         | [077%] 01h-08m(E) 01h-27m(X)
 <05h-13m> P1-cnode399: Xo@q[2] |################################        | [080%] 01h-10m(E) 01h-27m(X)
 <05h-15m> P1-cnode399: Xo@q[2] |#################################       | [082%] 01h-12m(E) 01h-28m(X)
 <05h-18m> P1-cnode399: Xo@q[2] |##################################      | [085%] 01h-15m(E) 01h-28m(X)
 <05h-20m> P1-cnode399: Xo@q[2] |###################################     | [087%] 01h-17m(E) 01h-28m(X)
 <05h-22m> P1-cnode399: Xo@q[2] |####################################    | [090%] 01h-19m(E) 01h-28m(X)
 <05h-26m> P1-cnode399: Xo@q[2] |#####################################   | [092%] 01h-23m(E) 01h-29m(X)
 <05h-29m> P1-cnode399: Xo@q[2] |######################################  | [095%] 01h-26m(E) 01h-31m(X)
 <05h-48m> P1-cnode399: Xo@q[2] |####################################### | [097%] 01h-45m(E) 01h-47m(X)
 <06h-21m> P1-cnode399: Xo@q[2] |########################################| [100%] 02h-18m(E) 02h-18m(X)
 <06h-21m> P1-cnode399: [MEMORY]  Free Xo_res( 136.6050 [Mb]) TOTAL:  10.26575 [Gb] (traced)  117.5200 [Mb] (memstat)
 <12h-01m> P1-cnode399: [PARALLEL distribution for X Frequencies on 256 CPU] Loaded/Total (Percentual):1/2(50%)
 <12h-01m> P1-cnode399: X@q[2] |                                        | [000%] --(E) --(X)
 <12h-01m> P1-cnode399: [MEMORY] Alloc KERNEL%blc( 1.019056 [Gb]) TOTAL:  11.27420 [Gb] (traced)  117.5200 [Mb] (memstat)
 <12h-01m> P1-cnode399: [MEMORY] Alloc Xo%blc( 1.019056 [Gb]) TOTAL:  12.29326 [Gb] (traced)  117.5200 [Mb] (memstat)
 <12h-02m> P1-cnode399: [MEMORY] Alloc BUFFER%blc( 1.019056 [Gb]) TOTAL:  13.31231 [Gb] (traced)  117.5200 [Mb] (memstat)
 <12h-04m> P1-cnode399: X@q[2] |########################################| [100%] 02m-03s(E) 02m-03s(X)
 <12h-04m> P1-cnode399: [MEMORY]  Free M_par%blc( 1.019056 [Gb]) TOTAL:  12.29326 [Gb] (traced)  117.5200 [Mb] (memstat)
 <12h-04m> P1-cnode399: [MEMORY]  Free M_par%blc( 1.019056 [Gb]) TOTAL:  11.27420 [Gb] (traced)  117.5200 [Mb] (memstat)
 <12h-04m> P1-cnode399: [MEMORY]  Free M_par%blc( 273.2110 [Mb]) TOTAL:  11.00099 [Gb] (traced)  117.5200 [Mb] (memstat)
 <12h-04m> P1-cnode399: [MEMORY] Alloc X_par%blc( 509.4830 [Mb]) TOTAL:  11.51047 [Gb] (traced)  117.5200 [Mb] (memstat)
 <12h-04m> P1-cnode399: [PARALLEL distribution for RL vectors(X) on 4 CPU] Loaded/Total (Percentual):32606955/******(25%)
 <12h-04m> P1-cnode399: [MEMORY]  Free M_par%blc( 1.019056 [Gb]) TOTAL:  10.49142 [Gb] (traced)  117.5200 [Mb] (memstat)
 <12h-04m> P1-cnode399: [MEMORY]  Free M_par%blc( 509.4830 [Mb]) TOTAL:  9.979949 [Gb] (traced)  117.5200 [Mb] (memstat)
 <12h-04m> P1-cnode399: [MEMORY]  Free X_par_lower_triangle%blc( 273.2110 [Mb]) TOTAL:  9.706738 [Gb] (traced)  117.5200 [Mb] (memstat)
 <12h-04m> P1-cnode399: [MEMORY] Alloc X_par_lower_triangle%blc( 273.2110 [Mb]) TOTAL:  9.979949 [Gb] (traced)  117.5200 [Mb] (memstat)
 <12h-04m> P1-cnode399: [MEMORY] Alloc X_par%blc( 273.2110 [Mb]) TOTAL:  10.25316 [Gb] (traced)  117.5200 [Mb] (memstat)
 <12h-04m> P1-cnode399: [PARALLEL distribution for RL vectors(X) on 4 CPU] Loaded/Total (Percentual):17485551/******(13%)
 <12h-04m> P1-cnode399: [X-CG] R(p) Tot o/o(of R):   10998   81000     100
 <12h-04m> P1-cnode399: Xo@q[3] |                                        | [000%] --(E) --(X)
Here the X0 for q2 consumes about 2 hours, but then it stops for about 6 hours(06h-21m to 12h-01m), then X for q2 starts. It seems strange. I wonder what does yambo do here that consumes 6 hours, and if I can do something to improve the performance. Thank you very much.

(I'm using 512 cores to work on a spin polarized system. 10Ry cut for epsilon and 300bands for summation. The system has 110 electrons.)

Best,
Yuanfan Xiong

Re: slow performance after a X0 is finished

Posted: Sun Feb 18, 2024 9:38 am
by Daniele Varsano
Dear Yuanfan,

please sign your post with your name and affiliation, this is a rule of the forum, you can do once for all by filling your signature in the user profile.

It is possible that you have some unbalance in the parallel structure. Unfortunately I cannot see the task distribution from the snapshot of the log files, If you post your input file or the entire log file, we will have a look, and we can provide some suggestion on how to tune the parallel strategy to avoid such an unbalance.

Best,
Daniele

Re: slow performance after a X0 is finished

Posted: Sun Feb 18, 2024 10:52 am
by xyf
Dear Daniele,

Thank you for your reply. I have added my name and affiliation now. And here are the input file and logs.
input.txt
l-gw_HF_and_locXC_gw0_dyson_rim_cut_em1d_ppa_CPU_1.txt
l-gw_HF_and_locXC_gw0_dyson_rim_cut_em1d_ppa_CPU_129.txt
Best,
Yuanfan Xiong

Re: slow performance after a X0 is finished

Posted: Mon Feb 19, 2024 8:21 am
by Daniele Varsano
Dear Yuanfan,

you can try to set the parallel strategy manually to improve balance and memory distribution, inserting in your input file:

Code: Select all

X_and_IO_CPU= "1 1 1 32 16"                 # [PARALLEL] CPUs for each role
X_and_IO_ROLEs= "q g k c v"               # [PARALLEL] CPUs roles (q,g,k,c,v)
I also strongly suggest updating to a more recent version of the code (5.2).

Best,
Daniele

Re: slow performance after a X0 is finished

Posted: Wed Feb 21, 2024 2:52 am
by xyf
Dear Daniele,

Thank you very much. I'll try the new version and the parallel strategy.

Best,
Yuanfan Xiong