ypp is very slow to calculate avehole

young · Post by **young** » Wed Jun 08, 2022 7:55 am

Dear developer，

I try to use ypp (5.0.4version) to calculate the average hole and electron, I find it is very slow to calculte even if I use several handreds cores. It seem the post-process is slower than the GW and BSE calculation. Is there any methond to accerate the calculate or how to setup the paralleled parameters？ Thanks very much!

Input for avehole
#############################
ElecTemp= 0.258520E-4 eV # Electronic Temperature
BoseTemp=-1.000000 eV # Bosonic Temperature
StdoHash= 40 # [IO] Live-timing Hashes
excitons # [R] Excitonic properties
avehole # [R] Average hole/electron wavefunction
infver # [R] Input file variables verbosity
wavefunction # [R] Wavefunction
Format= "c" # Output format [(c)ube/(g)nuplot/(x)crysden]
Direction= "123" # [rlu] [1/2/3] for 1d or [12/13/23] for 2d [123] for 3D
FFTGvecs= 28799 RL # [FFT] Plane-waves
States= "0 - 1" # Index of the BS state(s)
En_treshold= 0.000000 eV # Select states below this energy treshold
Res_treshold= 0.000000 # Select states above this optical strength treshold (max normalized to 1.)
BSQindex= 1 # Q-Index of the BS state(s)
Degen_Step= 0.010000 eV # Maximum energy separation of two degenerate states
Weight_treshold= 0.050000 # Print transitions above this weight treshold (max normalized to 1.)
WFMult= 1.000000 # Multiplication factor to the excitonic wavefunction
EHdensity= "h" # Calculate (h)ole/(e)lectron density from BSE wave-function
#############################

Output
###########################
<---> P1: [01] MPI/OPENMP structure, Files & I/O Directories
<---> P1: MPI Cores-Threads : 40(CPU)-4(threads)
<---> P1: MPI assigned to GPU : 0
<---> P1: [02] Y(ambo) P(ost)/(re) P(rocessor)
<---> P1: [03] Core DB
<---> P1: :: Electrons : 190.0000
<---> P1: :: Temperature : 0.950044E-3 [eV]
<---> P1: :: Lattice factors : 12.70797 11.00542 23.02844 [a.u.]
<---> P1: :: K points : 62
<---> P1: :: Bands : 1600
<---> P1: :: Symmetries : 12
<---> P1: :: RL vectors : 202433
<---> P1: [04] K-point grid
<---> P1: :: Q-points (IBZ): 62
<---> P1: :: X K-points (IBZ): 62
<---> P1: [05] CORE Variables Setup
<---> P1: [05.01] Unit cells
<---> P1: [05.02] Symmetries
<---> P1: [05.03] Reciprocal space
<---> P1: [05.04] K-grid lattice
<---> P1: Grid dimensions : 6 9 9
<---> P1: [05.05] Energies & Occupations
<02s> P1: [06] Excitonic Properties @ Q-index #1
<02s> P1: Sorting energies
<02s> P1: 2 excitonic states selected
<04s> P1: [06.01] Excitonic Wave Function
<04s> P1: [06.01.01] Real-Space grid setup
<04s> P1: [PARALLEL distribution for Wave-Function states] Loaded/Total(Percentual):744/744(100%)
<04s> P1: [FFT-EXCWF] Mesh size: 33 33 60
<08s> P1: [WF] Copying WF data from GPU device
<09s> P1: Extended grid : 33 33 60
<09s> P1: Processing 2 states
<09s> P1: State 1 Merged with states 1 -> 2
<09s> P1: ExcWF@1 | | [000%] --(E) --(X)
<32m-57s> P1: ExcWF@1 |# | [002%] 32m-48s(E) 21h-47m(X)
<01h-05m> P1: ExcWF@1 |## | [005%] 01h-05m(E) 21h-46m(X)
<01h-38m> P1: ExcWF@1 |### | [007%] 01h-38m(E) 21h-45m(X)
<02h-11m> P1: ExcWF@1 |#### | [010%] 02h-11m(E) 21h-45m(X)
<02h-43m> P1: ExcWF@1 |##### | [012%] 02h-43m(E) 21h-45m(X)
<03h-16m> P1: ExcWF@1 |###### | [015%] 03h-16m(E) 21h-45m(X)
<03h-48m> P1: ExcWF@1 |####### | [017%] 03h-48m(E) 21h-45m(X)
<04h-21m> P1: ExcWF@1 |######## | [020%] 04h-21m(E) 21h-45m(X)
<04h-54m> P1: ExcWF@1 |######### | [022%] 04h-53m(E) 21h-45m(X)
<05h-26m> P1: ExcWF@1 |########## | [025%] 05h-26m(E) 21h-45m(X)
#####################

BSE input
########################
WRbsWF
ppa # [R Xp] Plasmon Pole Approximation
rim_cut # [R RIM CUT] Coulomb potential
optics # [R OPT] Optics
bss # [R BSS] Bethe Salpeter Equation solver
em1d # [R Xd] Dynamical Inverse Dielectric Matrix
bse # [R BSE] Bethe Salpeter Equation.
bsk # [R BSK] Bethe Salpeter Equation kernel
K_Threads=0 # [OPENMP/BSK] Number of threads for response functions
#RandQpts= 1000000 # [RIM] Number of random q-points in the BZ
#RandGvec= 300 RL # [RIM] Coulomb interaction RS components
#CUTGeo= "box yz" # [CUT] Coulomb Cutoff geometry: box/cylinder/sphere/ws X/Y/Z/XY..
#% CUTBox
#0.00 | 56.00 | 28.00 | # [CUT] [au] Box sides
#%
#CUTRadius= 0.000000 # [CUT] [au] Sphere/Cylinder radius
#CUTCylLen= 0.000000 # [CUT] [au] Cylinder length
#CUTwsGvec= 0.700000 # [CUT] WS cutoff: number of G to be modified
Chimod= "hartree" # [X] IP/Hartree/ALDA/LRC/PF/BSfxc
BSEmod= "retarded" # [BSE] resonant/retarded/coupling
BSKmod= "SEX" # [BSE] IP/Hartree/HF/ALDA/SEX
BSSmod= "d" # [BSS] (h)aydock/(d)iagonalization/(i)nversion/(t)ddft`
XTermKind= "none" # [X] X terminator ("none","BG" Bruneval-Gonze)
XTermEn= 40 eV # [X] X terminator energy (only for kind="BG")
XfnQP_Wc_dos= 0.000000 eV # [EXTQP Xd] W dos pre-factor (conduction)
BSENGexx= 60 Ry # RL # [BSK] Exchange components
BSENGBlk= 8 Ry # RL # [BSK] Screened interaction block size

KfnQPdb= "E < ./ndb.QP" # [EXTQP BSK BSS] Database
KfnQP_N= 1 # [EXTQP BSK BSS] Interpolation neighbours
% KfnQP_E
0.0000000 | 1.000000 | 1.000000 | # [EXTQP BSK BSS] E parameters (c/v) eV|adim|adim
%

#WehCpl # [BSK] eh interaction included also in coupling
% BEnRange
-2.00000 | 10.00000 | eV # [BSS] Energy range
%
% BDmRange
0.030000 | 0.030000 | eV # [BSS] Damping range
%
BEnSteps=1000 # [BSS] Energy steps
% BLongDir
1.000000 | 1.000000 | 1.000000 | # [BSS] [cc] Electric Field
%
% BSEBands
185 | 196 | # [BSK] Bands range
%
DysSolver= "n" # [GW] Dyson Equation solver ("n","s","g")
% BndsRnXp
1 | 1580 | # [Xp] Polarization function bands
%
NGsBlkXp= 12 Ry # RL # [Xp] Response block size
% LongDrXp
1.000000 | 1.000000 | 1.000000 | # [Xp] [cc] Electric Field
%
PPAPntXp= 27.21138 eV # [Xp] PPA imaginary energy
####################################

Best regards
Ke

Post by **Daniele Varsano** » Wed Jun 08, 2022 8:43 am

Dear Ke,
this is rather strange.

Some considerations:

1) It seems you are running in a GPU machine, even if I'm not 100% sure this part of the postprocessing is not ported to exploit GPU device.
2) Yu can safely reduce the FFTGvecs variable, this will speed up your calculation
3) Please note the wfs index (States) starts from 1
4) Please note that two degenerate excited states are found and in this case, ypp consider a linear combination of the two. In case you want to inspect them one by one you need to reduce the Degen_Step value (e.g. 0.0 eV).

Best,
Daniele

young · Post by **young** » Wed Jun 08, 2022 6:23 pm

Dear Daniele,
Thanks for your reply.

########################
1) It seems you are running in a GPU machine, even if I'm not 100% sure this part of the postprocessing is not ported to exploit GPU device.
##########################
Indeed, I complied the yambo by nvfortran. I tested two cases, one is by ypp complied using "--enable-cuda" paramater, another is not using the "--enable-cuda" paramaters. I find they are very slow. I would like to know if we can use nvfortran?

#########################
2) Yu can safely reduce the FFTGvecs variable, this will speed up your calculation
3) Please note the wfs index (States) starts from 1
4) Please note that two degenerate excited states are found and in this case, ypp consider a linear combination of the two. In case you want to inspect them one by one you need to reduce the Degen_Step value (e.g. 0.0 eV).
#######################
I will follow your suggestions. Thanks so much.

Best regards
Ke

young · Post by **young** » Thu Jun 09, 2022 5:02 am

Dear Daniele,
Following your suggestion, I finished the calculation. But it seems very slow yet. I use 80 CPUS, and running about 5 hours. This systems have 22 atoms. For different MPI process, the efficiency are also different, for example, MPI1 just need 2 hours, but MPI45 need 5 hours. If I use more CPUs, the efficiency did not improve. I also try GPU version or CPU version compiled by nvfortran, both are slow. For GPU vestion, it seems data can load to GPU port， I checked the status of CPU and GPU, it seems GPU not running, just CPU running. Could I setup paralled parameters or is there are any method to improve the efficiency?

#############################
ElecTemp= 0.258520E-4 eV # Electronic Temperature
BoseTemp=-1.000000 eV # Bosonic Temperature
StdoHash= 40 # [IO] Live-timing Hashes
excitons # [R] Excitonic properties
avehole # [R] Average hole/electron wavefunction
infver # [R] Input file variables verbosity
wavefunction # [R] Wavefunction
Format= "c" # Output format [(c)ube/(g)nuplot/(x)crysden]
Direction= "123" # [rlu] [1/2/3] for 1d or [12/13/23] for 2d [123] for 3D
FFTGvecs= 8799 RL # before 28799 [FFT] Plane-waves
States= "1 - 1" # Index of the BS state(s)
En_treshold= 0.000000 eV # Select states below this energy treshold
Res_treshold= 0.000000 # Select states above this optical strength treshold (max normalized to 1.)
BSQindex= 1 # Q-Index of the BS state(s)
Degen_Step= 0.00000 eV # Maximum energy separation of two degenerate states
Weight_treshold= 0.050000 # Print transitions above this weight treshold (max normalized to 1.)
WFMult= 1.000000 # Multiplication factor to the excitonic wavefunction
EHdensity= "h" # Calculate (h)ole/(e)lectron density from BSE wave-function
#######################################

##################################################
<---> P45: [01] MPI/OPENMP structure, Files & I/O Directories
<---> P45-dcs057: MPI Cores-Threads : 80(CPU)-1(threads)
<---> P45-dcs057: [02] Y(ambo) P(ost)/(re) P(rocessor)
<---> P45-dcs057: [03] Core DB
<---> P45-dcs057: :: Electrons : 190.0000
<---> P45-dcs057: :: Temperature : 0.950044E-3 [eV]
<---> P45-dcs057: :: Lattice factors : 12.70797 11.00542 23.02844 [a.u.]
<---> P45-dcs057: :: K points : 62
<---> P45-dcs057: :: Bands : 1600
<---> P45-dcs057: :: Symmetries : 12
<---> P45-dcs057: :: RL vectors : 202433
<---> P45-dcs057: [04] K-point grid
<---> P45-dcs057: :: Q-points (IBZ): 62
<---> P45-dcs057: :: X K-points (IBZ): 62
<---> P45-dcs057: [05] CORE Variables Setup
<---> P45-dcs057: [05.01] Unit cells
<---> P45-dcs057: [05.02] Symmetries
<---> P45-dcs057: [05.03] Reciprocal space
<---> P45-dcs057: [05.04] K-grid lattice
<---> P45-dcs057: Grid dimensions : 6 9 9
<---> P45-dcs057: [05.05] Energies & Occupations
<04s> P45-dcs057: [06] Excitonic Properties @ Q-index #1
<04s> P45-dcs057: Sorting energies
<04s> P45-dcs057: 1 excitonic states selected
<05s> P45-dcs057: [06.01] Excitonic Wave Function
<05s> P45-dcs057: [06.01.01] Real-Space grid setup
<05s> P45-dcs057: [PARALLEL distribution for Wave-Function states] Loaded/Total(Percentual):744/744(100%)
<05s> P45-dcs057: [FFT-EXCWF] Mesh size: 24 24 39
<11s> P45-dcs057: Extended grid : 24 24 39
<14s> P45-dcs057: Processing 1 states
<14s> P45-dcs057: ExcWF@1 | | [000%] --(E) --(X)
<08m-20s> P45-dcs057: ExcWF@1 |# | [002%] 08m-05s(E) 04h-44m(X)
<15m-32s> P45-dcs057: ExcWF@1 |## | [005%] 15m-17s(E) 04h-46m(X)
<22m-44s> P45-dcs057: ExcWF@1 |### | [007%] 22m-29s(E) 04h-47m(X)
.......
...... I delete contents in the middle process
<04h-29m> P45-dcs057: ExcWF@1 |###################################### | [095%] 04h-29m(E) 04h-43m(X)
<04h-36m> P45-dcs057: ExcWF@1 |####################################### | [097%] 04h-36m(E) 04h-43m(X)
<04h-43m> P45-dcs057: ExcWF@1 |########################################| [100%] 04h-43m(E) 04h-43m(X)
<04h-43m> P45-dcs057: 3D Plot | | [000%] --(E) --(X)
<04h-43m> P45-dcs057: 3D Plot |########################################| [100%] --(E) --(X)
<04h-43m> P45-dcs057: [07] Timing Overview
<04h-43m> P45-dcs057: [TIMING] io_ATMPROJ_pwscf : 0.0001s CPU
<04h-43m> P45-dcs057: [TIMING] io_COL_CUT : 0.0002s CPU
<04h-43m> P45-dcs057: [TIMING] io_Double_Grid : 0.0004s CPU ( 2 calls, 0.200 msec avg)
<04h-43m> P45-dcs057: [TIMING] PP_uspp_init : 0.0004s CPU
<04h-43m> P45-dcs057: [TIMING] io_DIPOLES : 0.0128s CPU
<04h-43m> P45-dcs057: [TIMING] FFT_setup : 0.0215s CPU
<04h-43m> P45-dcs057: [TIMING] io_QINDX : 0.0341s CPU
<04h-43m> P45-dcs057: [TIMING] io_GROT : 0.0477s CPU ( 2 calls, 23.865 msec avg)
<04h-43m> P45-dcs057: [TIMING] io_fragment : 0.1590s CPU (62 calls, 2.565 msec avg)
<04h-43m> P45-dcs057: [TIMING] io_WF : 0.3456s CPU (65 calls, 5.316 msec avg)
<04h-43m> P45-dcs057: [TIMING] io_BSS_diago : 1.5415s CPU ( 4 calls, 0.385 sec avg)
<04h-43m> P45-dcs057: [TIMING] WF_load_FFT : 5.6155s CPU
<04h-43m> P45-dcs057: [08] Game Over & Game summary
<04h-43m> P45-dcs057: [TIMING] [Time-Profile]: 04h-43m
##################################################
################################
<---> P1: [01] MPI/OPENMP structure, Files & I/O Directories
<---> P1-dcs056: MPI Cores-Threads : 80(CPU)-1(threads)
<---> P1-dcs056: [02] Y(ambo) P(ost)/(re) P(rocessor)
<---> P1-dcs056: [03] Core DB
<---> P1-dcs056: :: Electrons : 190.0000
<---> P1-dcs056: :: Temperature : 0.950044E-3 [eV]
<---> P1-dcs056: :: Lattice factors : 12.70797 11.00542 23.02844 [a.u.]
<---> P1-dcs056: :: K points : 62
<---> P1-dcs056: :: Bands : 1600
<---> P1-dcs056: :: Symmetries : 12
<---> P1-dcs056: :: RL vectors : 202433
<---> P1-dcs056: [04] K-point grid
<---> P1-dcs056: :: Q-points (IBZ): 62
<---> P1-dcs056: :: X K-points (IBZ): 62
<---> P1-dcs056: [05] CORE Variables Setup
<---> P1-dcs056: [05.01] Unit cells
<---> P1-dcs056: [05.02] Symmetries
<---> P1-dcs056: [05.03] Reciprocal space
<---> P1-dcs056: [05.04] K-grid lattice
<---> P1-dcs056: Grid dimensions : 6 9 9
<---> P1-dcs056: [05.05] Energies & Occupations
<04s> P1-dcs056: [06] Excitonic Properties @ Q-index #1
<04s> P1-dcs056: Sorting energies
<04s> P1-dcs056: 1 excitonic states selected
<05s> P1-dcs056: [06.01] Excitonic Wave Function
<05s> P1-dcs056: [06.01.01] Real-Space grid setup
<05s> P1-dcs056: [PARALLEL distribution for Wave-Function states] Loaded/Total(Percentual):744/744(100%)
<05s> P1-dcs056: [FFT-EXCWF] Mesh size: 24 24 39
<11s> P1-dcs056: Extended grid : 24 24 39
<14s> P1-dcs056: Processing 1 states
<14s> P1-dcs056: ExcWF@1 | | [000%] --(E) --(X)
<03m-58s> P1-dcs056: ExcWF@1 |# | [002%] 03m-43s(E) 02h-11m(X)
<07m-14s> P1-dcs056: ExcWF@1 |## | [005%] 06m-59s(E) 02h-11m(X)
<10m-29s> P1-dcs056: ExcWF@1 |### | [007%] 10m-15s(E) 02h-11m(X)
.....
<02h-08m> P1-dcs056: ExcWF@1 |####################################### | [097%] 02h-07m(E) 02h-11m(X)
<02h-11m> P1-dcs056: ExcWF@1 |########################################| [100%] 02h-11m(E) 02h-11m(X)
<04h-43m> P1-dcs056: 3D Plot | | [000%] --(E) --(X)
<04h-43m> P1-dcs056: 3D Plot |########################################| [100%] --(E) --(X)
<04h-43m> P1-dcs056: [07] Timing Overview
<04h-43m> P1-dcs056: [TIMING] io_ATMPROJ_pwscf : 0.0001s CPU
<04h-43m> P1-dcs056: [TIMING] PP_uspp_init : 0.0001s CPU
<04h-43m> P1-dcs056: [TIMING] io_COL_CUT : 0.0002s CPU
<04h-43m> P1-dcs056: [TIMING] io_Double_Grid : 0.0006s CPU ( 2 calls, 0.284 msec avg)
<04h-43m> P1-dcs056: [TIMING] io_DIPOLES : 0.0127s CPU
<04h-43m> P1-dcs056: [TIMING] FFT_setup : 0.0235s CPU
<04h-43m> P1-dcs056: [TIMING] io_QINDX : 0.0286s CPU
<04h-43m> P1-dcs056: [TIMING] io_GROT : 0.0534s CPU ( 2 calls, 26.715 msec avg)
<04h-43m> P1-dcs056: [TIMING] io_fragment : 0.1562s CPU (62 calls, 2.519 msec avg)
<04h-43m> P1-dcs056: [TIMING] io_WF : 0.3054s CPU (65 calls, 4.698 msec avg)
<04h-43m> P1-dcs056: [TIMING] io_BSS_diago : 1.4951s CPU ( 4 calls, 0.374 sec avg)
<04h-43m> P1-dcs056: [TIMING] WF_load_FFT : 5.4984s CPU
<04h-43m> P1-dcs056: [08] Game Over & Game summary
<04h-43m> P1-dcs056: [TIMING] [Time-Profile]: 04h-43m
###############################

Best Regards
Ke

Post by **Daniele Varsano** » Thu Jun 09, 2022 11:54 am

Dear Ke,

so it seems that this part of the code is not ported to GPU, for this reason running in a GPU machine is a waste of resources. My suggestion then, si to run in a CPU cluster using INTEL or gfortran compilers.

In the meanwhile, we will investigate if there are evident bottleneck in this kind of calculation. The avehole is a rather new feature and not largely used so it is possible that its performance has not yet been tested properly.

Best,
Daniele

young · Post by **young** » Thu Jun 09, 2022 4:42 pm

Dear Daniele，

Thanks for your good suggestions. I will try intel version.

Best
Ke

Yambo Community Forum

ypp is very slow to calculate avehole

ypp is very slow to calculate avehole

Re: ypp is very slow to calculate avehole

Re: ypp is very slow to calculate avehole

Re: ypp is very slow to calculate avehole

Re: ypp is very slow to calculate avehole

Re: ypp is very slow to calculate avehole