BSE diagonalization solver error

sdwang · Post by **sdwang** » Fri Oct 04, 2019 1:44 pm

Dear Daniele,
I am running with the version compiled using --enable-memory-profile, but ther is nothing memory information in the log file as attached. In my previous calculation the memory information is there. Does this mean tehe meomory can not be distributed?
I rerun as your suggestion remove the scalapack distribution and lower the cores to 12, but it still do not work.

Thanks!
Shudong

Post by **Daniele Varsano** » Fri Oct 04, 2019 1:48 pm

Hi Shudong,
it seems to me strange you do not have information on memory usage. Try to type ./configure -h and see the exact keyword in order to compile with memory monitoring.
Another advice is to move all the parallelization on the k points:

Code: Select all

BS_CPU= "N 1 1"                     # [PARALLEL] CPUs for each role
BS_ROLEs= "k eh t"

Best,
Daniele

will · Post by **will** » Sat Oct 05, 2019 4:38 am

Hi Shudong,
Finally, I solved the problem using the serial run. And the memory used in the serial run is rather smaller than the memory per core in the parallel run. Maybe the memories are not correctlly distributed in the parallel run.

Best,
Xiaowei

Post by **Daniele Varsano** » Sat Oct 05, 2019 7:05 am

Dear Shudong,
I have also noticed you compiled Yambo in double precision, do you really need that?
Having the BSE matrix in double-precision, It does not help in terms of memory usage.
Best,
Daniele

sdwang · Post by **sdwang** » Sat Oct 05, 2019 8:45 am

Dear Daniele,
You are right. I used double precision version. For my system, if I used single precision Yambo, the G vectors after initialization is far less than in nscf.out, and if switch to dp version,
it increases obviously but still less than in nscf.out.

Thanks!

Shudong

sdwang · Post by **sdwang** » Sun Oct 06, 2019 8:57 am

Dear Daniele,
I rerun it as your suggestion moving all paralell options to k, but it failed as before. I monitored the memory usage and found the memory did not reach to the limit of the machine (1Tb) before
it stoped.

Attached is the log and shown the diagonalization stoped.

I noticed in previous you mentioned one can perform serial calculation without rerun the BS matrix. My process is do yambo -b, and then yambo -o b -k sex -y d. I think your suggestion is if this stopped but with kernel OK, we can go on with yambo -y d using serial calculation. But when I do this(yambo -y d), the kernel rerun again...

Thanks!

Best

Shudong

sdwang · Post by **sdwang** » Sun Oct 06, 2019 9:04 am

will wrote:Hi Shudong,
Finally, I solved the problem using the serial run. And the memory used in the serial run is rather smaller than the memory per core in the parallel run. Maybe the memories are not correctlly distributed in the parallel run.

Best,
Xiaowei

Dear Xiaowei,
Thanks for your suggestion. I am not sure if the serial run works or not since it will take a long time. How did you do that?

Thanks!

SD

Post by **Daniele Varsano** » Sun Oct 06, 2019 4:10 pm

Dear Shudong,

found the memory did not reach to the limit of the machine (1Tb) before it stoped.

OK, it stops when it tries to allocate and the memory is not enough.

I noticed in previous you mentioned one can perform serial calculation without rerun the BS matrix. But when I do this(yambo -y d), the kernel rerun again...

Usually, when it rerun in the report you should find the reason why it happens as en (ERR). Anyway, in your case this is because the struture of the BSE database is CPU dependent so in order to read the BSE matrix you need the same number of cpus you used to calculate it.

Considering that the building of the BSE matrix is not that slow, what I can suggest you is to repeat the calculation using less CPU's and activate OMP threads, in this case, the variable

Code: Select all

 K_Threads

, in orde to do that you need to compile yambo with the omp support.
Finally please note this warning message in your report:

Code: Select all

 [WARNING] n_eh_CPU > 1 in a system with symmetries and k-points is not efficient. Try distributing first "k" and "t"

Best,
Daniele

sdwang · Post by **sdwang** » Mon Oct 07, 2019 9:44 am

Dear Daniele,
I ran with OMP and my setting is:
PAR_def_mode= "memory" # [PARALLEL] Default distribution mode ("balanced"/"memory"/"workload")
BS_CPU= "1 2 4" # [PARALLEL] CPUs for each role
BS_ROLEs= "k eh t" # [PARALLEL] CPUs roles (k,eh,t)
BS_nCPU_LinAlg_INV= 4 # [PARALLEL] CPUs for Linear Algebra
BS_nCPU_LinAlg_DIAGO= 4 # [PARALLEL] CPUs for Linear Algebra
#X_Threads=2 # [OPENMP/X] Number of threads for response functions
#DIP_Threads=2 # [OPENMP/X] Number of threads for dipoles
K_Threads=8 # [OPENMP/BSK] Number of threads for response functions
,but it stoped with log as:
...
<13m-35s> P0002: Kernel |################## | [090%] 13m-28s(E) 14m-58s(X)
<14m-22s> P0002: Kernel |################### | [095%] 14m-15s(E) 15m-00s(X)
<15m-12s> P0002: Kernel |####################| [100%] 15m-06s(E) 15m-06s(X)
<15m-32s> P0002: [06] BSE solver(s)
<15m-32s> P0002: [LA@Response_T_space] PARALLEL linear algebra uses a 2x2 SLK grid (4 cpu)
<15m-32s> P0002: [06.01] Diago solver
<15m-32s> P0002: [MEMORY] Alloc BS_mat( 21.23366Gb) TOTAL: 23.20335Gb (traced)
<16m-48s> P0002: BSK diagonalize | | [000%] --(E) --(X)
<16m-48s> P0002: [MEMORY] Alloc M_loc%blc( 5.308416Gb) TOTAL: 27.21647Gb (traced)
<19m-49s> P0002: [MEMORY] Alloc V_loc%blc( 5.308416Gb) TOTAL: 32.52488Gb (traced) 360.6920Mb (memstat)
, and in report file, the error is :
[06.01] Diago solver
====================
[WARNING]Allocation attempt of WS%v_cmplx of zero size.

If I increased the BS_nCPU_LinAlg_INV and BS_nCPU_LinAlg_DIAGO from 4 to 8, the calculation becomes very slow...
Thanks!

Best

Shudong

Post by **Daniele Varsano** » Mon Oct 07, 2019 3:03 pm

Dear Shudong,
as I told you before I would skip the scalapack parallelization and move all the CPU on "k".
Try to see if even by using a few CPUs the BSE matrix is built in a reasonable time.

Best,
Daniele

Yambo Community Forum

BSE diagonalization solver error

Re: BSE diagonalization solver error

Re: BSE diagonalization solver error

Re: BSE diagonalization solver error

Re: BSE diagonalization solver error

Re: BSE diagonalization solver error

Re: BSE diagonalization solver error

Re: BSE diagonalization solver error

Re: BSE diagonalization solver error

Re: BSE diagonalization solver error

Re: BSE diagonalization solver error