Parallelism of BSE kernel

anhhv · Post by **anhhv** » Sat Apr 02, 2022 4:00 pm

Dear YAMBO,

Can you provide me some tips for BSE kernel parallelization. I followed some variables of BSE parallel

BS_CPU= "47 1 28" # [PARALLEL] CPUs for each role
BS_ROLEs= "k eh t" # [PARALLEL] CPUs roles (k,eh,t)
BS_nCPU_LinAlg_INV= 8 # [PARALLEL] CPUs for matrix inversion
BS_nCPU_LinAlg_DIAGO= 8 # [PARALLEL] CPUs for matrix diagonalization

but it doesn't work. Looking at output file, these configurations were not taken into account.

Another question is which parallelism gives memory distribution. It seems that this calculation requires a lot of memory.

I attached here input and output files. Thank you.

Post by **Daniele Varsano** » Sat Apr 02, 2022 4:19 pm

Dear Viet-Anh Ha,,

please note that parallel linear algebra will be not in place if you do not compile the code using SCALAPACK libraries:

Code: Select all

BS_nCPU_LinAlg_INV= 8 # [PARALLEL] CPUs for matrix inversion
BS_nCPU_LinAlg_DIAGO= 8 # [PARALLEL] CPUs for matrix diagonalization

Anyway this are need for the diagoalization of the BSE matrix, while you are experiencing problems in the building of the kernel.

Why do you say that BS_CPU and BS_ROLEs are not taken into account? The parallel distribution of the kernel is reported in the log files.
You have a lot of k points (1000 in the BZ) so you can easily end up with very large matrices I really suggest you to reduce the number of bands in BSE.

Code: Select all

% BSEBands
  1 | 18 |                   # [BSK] Bands range
%

The first band is around 41 eV below the Fermi level, this will not participate in the first excitations ie in the low energy part of the spectrum.
Also, the last bands 16/17/18 are 40eV above the Fermi energy. I would reduce drastically the range of bands in the BSE, this will reduce a lot the computational burden.
Best,
Daniele

anhhv · Post by **anhhv** » Sat Apr 02, 2022 9:00 pm

Thank you Daniele.

I did not pay attention to log files and didn't recognize that parallel was performed. I just looked at the beginning of output file with the following text

* CPU-Threads :1316(CPU)-1(threads)-1(threads@X)-1(threads@DIP)-1(threads@SE)-1(threads@RT)-1(threads@K)-1(threads@NL)
* MPI CPU : 1316
* THREADS (max): 1
* THREADS TOT(max): 1316
* I/O NODES : 1
* Fragmented WFs :yes

so I thought the parallelization was not successfully.

Another question is why NLogCPUs does not work in this run. I set NLogCPUs = 2 but there are still a number of log files written (=number of MPIs).

Post by **Daniele Varsano** » Sun Apr 03, 2022 5:58 pm

Dear Viet-Anh Ha,

Another question is why NLogCPUs does not work in this run. I set NLogCPUs = 2 but there are still a number of log files written (=number of MPIs).

That's strange, are you sure that there are no logs from previous runs? These are not deleted, and an incremental number is used *_01 etc is used for new ones.
Finally, I can see you are using a rather old version of Yambo ( GPL Version 4.4.0), I suggest you to update to a newer version.
Best,
Daniele

anhhv · Post by **anhhv** » Tue Apr 05, 2022 9:24 pm

I tried to delete log files in LOG directory but still see the same problem. Yes, I'll update the newest version.

Guo_BIT · Post by **Guo_BIT** » Sat Nov 02, 2024 10:35 am

Dear Daniele,

We also encountered issues during our BSE calculations. The final output of our LOG file is as follows

Code: Select all

 <51m-31s> P1-n7: [LA] SERIAL linear algebra
 <51m-31s> P1-n7: [09.01.07] Diago Solver @q1
 <51m-31s> P1-n7: Folding BSE Kernel |                                        | [000%] --(E) --(X)
 <51m-31s> P1-n7: Folding BSE Kernel |########################################| [100%] --(E) --(X)

I believe there is a problem with how we specified our parallel parameters

Code: Select all

DIP_CPU= "1 24 3"                      # [PARALLEL] CPUs for each role
DIP_ROLEs= "k,c,v"                    # [PARALLEL] CPUs roles (k,c,v)
DIP_Threads=0                    # [OPENMP/X] Number of threads for dipoles
X_and_IO_CPU= "1 1 1 24 3"                 # [PARALLEL] CPUs for each role
X_and_IO_ROLEs= "q,g,k,c,v"       # [PARALLEL] CPUs roles (q,g,k,c,v)
X_and_IO_nCPU_LinAlg_INV=-1      # [PARALLEL] CPUs for Linear Algebra (if -1 it is automatically set)
X_Threads=0                      # [OPENMP/X] Number of threads for response functions
BS_CPU= "12 1 6"                       # [PARALLEL] CPUs for each role
BS_ROLEs= "k,eh,t"                     # [PARALLEL] CPUs roles (k,eh,t)
BS_nCPU_LinAlg_INV=-1            # [PARALLEL] CPUs for Linear Algebra (if -1 it is automatically set)
BS_nCPU_LinAlg_DIAGO=-1          # [PARALLEL] CPUs for Linear Algebra (if -1 it is automatically set)
K_Threads=0                      # [OPENMP/BSK] Number of threads for response functions

Therefore, we hope to receive your advice. In this calculation, there are a total of 600 k-points and 36 occupied band

Best wishes
Jingda Guo

Post by **Daniele Varsano** » Mon Nov 04, 2024 9:01 am

Dear Jingda,
this can be a memory issue, but we would need to inspect your input and report file.
In particular, if you compile the code using the flag --enable-memory-profile you can have trace of the memory allocated in the log files.
Can you check the dimension of your BSE matrix? It is reported in the report file.
In any case, you can try to run the diagonalization step using fewer CPUs, in this way you will have more memory-per-node.

Best,
Daniele

Yambo Community Forum

Parallelism of BSE kernel

Parallelism of BSE kernel

Re: Parallelism of BSE kernel

Re: Parallelism of BSE kernel

Re: Parallelism of BSE kernel

Re: Parallelism of BSE kernel

Re: Parallelism of BSE kernel

Re: Parallelism of BSE kernel