Page 1 of 1

CRASH of BSE calculation in parallel?

Posted: Tue Oct 27, 2020 2:22 am
by Dean
Dear all,
I am doing a BSE calculation in parallel, but it crashed without clear hint. I have tried to use more nodes to ensure enough memory or change the BSE-related variables: BS_CPU/ BS_ROLEs, but it didn't work. The input and output files is attached.
Any suggestion would be appreciated.

Re: CRASH of BSE calculation in parallel?

Posted: Tue Oct 27, 2020 6:19 pm
by Davide Sangalli
Dr. Yimin Ding,
it is very hard to get the reason from just input file and report.

Anyway It seems that your run correctly performs the "perturbative inversion" part for 142 of the frequency points in your frequency axis and then crashes when trying the "full inversion" for the remaining 259.
The total is what you set in input

Code: Select all

BEnSteps=401                     # [BSS] Energy steps
The first thing you can try is to set in input (default il "pf")

Code: Select all

BSSInvMode="p"  
and play with the variable (0.01 is the default, in my experience the negative values do not work properly)

Code: Select all

BSEPSInvTrs=0.01   '[BSS EPS] Inversion treshold. Relative[o/o](>0)/Absolute(<0)'
The perturbative only part will likely not get all the frequencies, but at least you can get the solution for some frequencies.
For the others you will have some zeros. Playing with (i.e. increasing) the smearing may also help

The next step is to try the full inversion and for that I'd advise to check if it is a memory problem by first solving the BSE without double grid in diago mode.
I think the full inversion may require the same amount of memory. Moreover the memory in that step is not distributed and the inversion operation is serial, unless you use scalapack.
So you may prefer to do that last step in serial (loading all data from the previous step)

Best,
D.

Re: CRASH of BSE calculation in parallel?

Posted: Wed Oct 28, 2020 9:37 am
by Dean
Dear Davide,
Thanks for your reply firstly.
When I set small BSEbands (such as 37 | 49 |), yambo runs with correct outputs, please see the attached file.
But, when I set a little larger BSEbands (such as 35 | 50 |), it crashed with solution for some not all frequencies.
I tryied to use many nodes so as to get enough RAM menory, but it did not work and I guess the memory is not distributed parallelly in these nodes.
So, how to detarmin the running is in serial or in paralle?
Any suggestion would be appreciated.

Re: CRASH of BSE calculation in parallel?

Posted: Mon Nov 02, 2020 9:24 am
by claudio
Dr. Yimin Ding,

in order to have more memory for the BSE inversion you can use two strategies:

1) increase the number of threads, 2, 3 or 4 threads,
in some system, you have to set

OMP_NUM_THREADS=2 (or 3, 4)

2) compile yambo with Scalpack and increase the number of processor in linear algebra (4, 9, 16 or more).
To do this, add the flag -V par when you generate the input and set:

BS_nCPU_LinAlg_INV=4
BS_nCPU_LinAlg_DIAGO=4

for example.

You can play with these two strategies together. For example is you have 32 cores you can set 2 Threads and 16 processors in linear algebra.

best
Claudio

Re: CRASH of BSE calculation in parallel?

Posted: Tue Nov 03, 2020 9:05 am
by Dean
Dear Claudio,
Thanks for your reply. I will try it.
Best,
Yimin Ding

Re: CRASH of BSE calculation in parallel?

Posted: Fri Dec 11, 2020 2:09 am
by Dean
Dear Claudio,
According to your suggestion, I compile yambo with Scalpack and set "BS_nCPU_LinAlg_INV , BS_nCPU_LinAlg_DIAGO" in BSE calculations.
When I use two nodes (24 cores per node) and set "BS_nCPU_LinAlg_INV=4 , BS_nCPU_LinAlg_DIAGO=4", the jobs runs successfully.
But, when I use more nodes (such as 3, 4, 5,6,7,8) and set many valuse of "BS_nCPU_LinAlg_INV , BS_nCPU_LinAlg_DIAGO" , the jobs always crashed with no output.
As time goes on, my frustration boiled over.
Any suggestion would be appreciated.

Re: CRASH of BSE calculation in parallel?

Posted: Fri Dec 11, 2020 10:51 am
by Daniele Varsano
Dear Yimin,
can you show also the input files?
My poor man suggestion here is to reduce the number of CPU and raise the number of threads, this will allow having more memory per core inside the node. May be others, experts in the inversion procedure can give you a better suggestion.

Best,
Daniele

Re: CRASH of BSE calculation in parallel?

Posted: Fri Dec 11, 2020 2:30 pm
by claudio
Dear Yimm

my suggestions are:

1) try to double the number of nodes from 2 to 4 and at the same moment double the number of threads

2) try other methods like Haydock, they are much more parallalerized
and distributed in memory. With Haydock I solved BSE matrices up a size of 200.000.

best
Claudio

Re: CRASH of BSE calculation in parallel?

Posted: Mon Dec 14, 2020 3:51 am
by Dean
Dear Claudio,
Thanks for your reply.
I will try this method "1) try to double the number of nodes from 2 to 4 and at the same moment double the number of threads"
I want to use double-grid method speed up dielectric constant calculations, then I have to use inversion method.
Best,
Yimin Ding