Does error suggest computational limits?

jmullen · Post by **jmullen** » Wed Nov 11, 2009 3:41 pm

Hi,

I am running convergence studies again (still), and have increased the density of the k point mesh as the primary step in a successful convergence of the system. Now, I have increased it one more time and I get the following error:

<07s> [M 2.681 Gb] Alloc bare_qpg (2.476)_pmii_daemon(SIGCHLD): PE 252 exit signal Killed

The step I am at is the first yambo run (setup) after the execution of p2y. I have moved my calculations from my poor, overtaxed 20 processor cluster (where it showed the same error) to a shiny new cluster using 256 processors. Before I go further and throw more processors at it, I would like some verification that this error does indeed suggest "too few processors" and if not, what is the cause (or where should I start looking). The complete calculation for the GW corrections does work fine for the system, the only change in the system configuration is moving from a moderate grid mesh to a grid mesh denser by a factor of 2 (or so).

Cheers

Post by **myrta gruning** » Wed Nov 11, 2009 4:41 pm

Hallo,

What is the memory available per processor?
Yambo was trying to allocate about 2.7 Gb when the error occurred. That is what [M 2.681 Gb] Alloc means. The allocated variable was bare_qpg.

If the problem is the memory you may try to lower the MaxGvecs (Max number of G-vectors planned to use)
(to get this variable generate the input with yambo -i -V RL, or -V 1 in older Yambo revisions)
Of course you would need to check as well how lowering this number will affect the accuracy of the calculation.

Regards,
Myrta

Post by **andrea marini** » Fri Nov 13, 2009 9:57 am

jmullen wrote: The step I am at is the first yambo run (setup) after the execution of p2y. I have moved my calculations from my poor, overtaxed 20 processor cluster (where it showed the same error) to a shiny new cluster using 256 processors. Before I go further and throw more processors at it, I would like some verification that this error does indeed suggest "too few processors" and if not, what is the cause (or where should I start looking).

Deare Jeff, I would like to add to Myrta's remarks a note. Yambo is fully parallelized however it is poorly memory distributed. In practice this means that, except in a few cases like the BSE, if you increase the number of CPUS you go faster but using more or less the same amount of memory. So if you broke the 2Gb limit (that, in my opinion, is just crazy in 64bits machines) on a 2 cpus cluster you will break it again on a 1000 Cpus cluster. But of course you will go much more faster

I am working to remove this limit but it is really really hard as, in contrast to DFT codes, Yambo must calculate non-local operators (like the Hartree-Fock self-energy) that requires cross-scatterings between k-points that makes really hard to distribute the informations among the cpus without overusing the inter-cpu communications.

Andrea

jmullen · Post by **jmullen** » Thu Nov 19, 2009 5:04 pm

Thanks for the information,

I want to contribute one last piece of information. Earlier, I posted the error

<07s> [M 2.681 Gb] Alloc bare_qpg (2.476)_pmii_daemon(SIGCHLD): PE 252 exit signal Killed

I dug through the cluster information and found that there is 16 GB per core. Since I am not clear on how memory is allocated in the code, I cannot really venture any meaningful statement on how this relates to the allocation failure. Perhaps there are compiler issues that I have stumbled across.

Anyway, I hope this may suggest something useful to you.

Cheers

Post by **myrta gruning** » Thu Nov 19, 2009 5:23 pm

Hallo

If it is 16Gb per core then the problem is not the allocation of that matrix. But maybe there are also other large matrices, and the problem may be their sum (are there other alloc messages in the log?).
Maybe you can try to launch the job and follow it with top shell command. In that way you may check how much memory is used, even though since the error happens after just 7 sec it may be difficult to "observe" the job running.
Also you can indeed see what happens by reducing the MaxGvecs (just to check if the problem is indeed the allocation of matrices)

Cheers,

m

jmullen · Post by **jmullen** » Thu Nov 19, 2009 5:48 pm

Greetings

The error is related to the value of FFTGvecs and EXXRLvcs. I am increasing these values, the number of bands included in the calculation and the k mesh density, in a systematic process to test for convergence. I am attempting to close the gap at the Dirac point in graphene (the existence of the gap obviously suggest a non-converged calculation). However, I have been unable to do so because I reach a wall where the yambo calculation encounters the allocation problem before the gap closes.

Cheers

Post by **andrea marini** » Thu Nov 19, 2009 5:58 pm

jmullen wrote: The error is related to the value of FFTGvecs and EXXRLvcs. I am increasing these values, the number of bands included in the calculation and the k mesh density, in a systematic process to test for convergence. I am attempting to close the gap at the Dirac point in graphene (the existence of the gap obviously suggest a non-converged calculation). However, I have been unable to do so because I reach a wall where the yambo calculation encounters the allocation problem before the gap closes.

Jeff, it is hard to say anything of accurate without having a look at the details of the calculation. Do you mind posting the input, the log and the output of the crashed calculation ? We can try tpo infer from these what allocation is causing the problem.

Post by **Conor Hogan** » Fri Nov 20, 2009 5:44 pm

Dear Jeff,
In the meantime, I suggest you run a small F90 program (or better, a small MPI program) that allocates, fills and deallocates successively larger and larger matrices - maybe there is a problem with large single array sizes. In any case, its useful to see how much RAM you really have access to.
Conor

Yambo Community Forum

Does error suggest computational limits?

Does error suggest computational limits?

Re: Does error suggest computational limits?

Re: Does error suggest computational limits?

Re: Does error suggest computational limits?

Re: Does error suggest computational limits?

Re: Does error suggest computational limits?

Re: Does error suggest computational limits?

Re: Does error suggest computational limits?