Does error suggest computational limits?
Moderators: Davide Sangalli, andrea.ferretti, myrta gruning, andrea marini, Daniele Varsano
-
- Posts: 29
- Joined: Wed Apr 01, 2009 6:29 pm
Does error suggest computational limits?
Hi,
I am running convergence studies again (still), and have increased the density of the k point mesh as the primary step in a successful convergence of the system. Now, I have increased it one more time and I get the following error:
<07s> [M 2.681 Gb] Alloc bare_qpg (2.476)_pmii_daemon(SIGCHLD): PE 252 exit signal Killed
The step I am at is the first yambo run (setup) after the execution of p2y. I have moved my calculations from my poor, overtaxed 20 processor cluster (where it showed the same error) to a shiny new cluster using 256 processors. Before I go further and throw more processors at it, I would like some verification that this error does indeed suggest "too few processors" and if not, what is the cause (or where should I start looking). The complete calculation for the GW corrections does work fine for the system, the only change in the system configuration is moving from a moderate grid mesh to a grid mesh denser by a factor of 2 (or so).
Cheers
I am running convergence studies again (still), and have increased the density of the k point mesh as the primary step in a successful convergence of the system. Now, I have increased it one more time and I get the following error:
<07s> [M 2.681 Gb] Alloc bare_qpg (2.476)_pmii_daemon(SIGCHLD): PE 252 exit signal Killed
The step I am at is the first yambo run (setup) after the execution of p2y. I have moved my calculations from my poor, overtaxed 20 processor cluster (where it showed the same error) to a shiny new cluster using 256 processors. Before I go further and throw more processors at it, I would like some verification that this error does indeed suggest "too few processors" and if not, what is the cause (or where should I start looking). The complete calculation for the GW corrections does work fine for the system, the only change in the system configuration is moving from a moderate grid mesh to a grid mesh denser by a factor of 2 (or so).
Cheers
Jeff Mullen
NCSU Physics
NCSU Physics
- myrta gruning
- Posts: 242
- Joined: Tue Mar 17, 2009 11:38 am
- Contact:
Re: Does error suggest computational limits?
Hallo,
What is the memory available per processor?
Yambo was trying to allocate about 2.7 Gb when the error occurred. That is what [M 2.681 Gb] Alloc means. The allocated variable was bare_qpg.
If the problem is the memory you may try to lower the MaxGvecs (Max number of G-vectors planned to use)
(to get this variable generate the input with yambo -i -V RL, or -V 1 in older Yambo revisions)
Of course you would need to check as well how lowering this number will affect the accuracy of the calculation.
Regards,
Myrta
What is the memory available per processor?
Yambo was trying to allocate about 2.7 Gb when the error occurred. That is what [M 2.681 Gb] Alloc means. The allocated variable was bare_qpg.
If the problem is the memory you may try to lower the MaxGvecs (Max number of G-vectors planned to use)
(to get this variable generate the input with yambo -i -V RL, or -V 1 in older Yambo revisions)
Of course you would need to check as well how lowering this number will affect the accuracy of the calculation.
Regards,
Myrta
Dr Myrta Grüning
School of Mathematics and Physics
Queen's University Belfast - Northern Ireland
http://www.researcherid.com/rid/B-1515-2009
School of Mathematics and Physics
Queen's University Belfast - Northern Ireland
http://www.researcherid.com/rid/B-1515-2009
- andrea marini
- Posts: 325
- Joined: Mon Mar 16, 2009 4:27 pm
- Contact:
Re: Does error suggest computational limits?
Deare Jeff, I would like to add to Myrta's remarks a note. Yambo is fully parallelized however it is poorly memory distributed. In practice this means that, except in a few cases like the BSE, if you increase the number of CPUS you go faster but using more or less the same amount of memory. So if you broke the 2Gb limit (that, in my opinion, is just crazy in 64bits machines) on a 2 cpus cluster you will break it again on a 1000 Cpus cluster. But of course you will go much more fasterjmullen wrote: The step I am at is the first yambo run (setup) after the execution of p2y. I have moved my calculations from my poor, overtaxed 20 processor cluster (where it showed the same error) to a shiny new cluster using 256 processors. Before I go further and throw more processors at it, I would like some verification that this error does indeed suggest "too few processors" and if not, what is the cause (or where should I start looking).

I am working to remove this limit but it is really really hard as, in contrast to DFT codes, Yambo must calculate non-local operators (like the Hartree-Fock self-energy) that requires cross-scatterings between k-points that makes really hard to distribute the informations among the cpus without overusing the inter-cpu communications.
Andrea
Andrea MARINI
Istituto di Struttura della Materia, CNR, (Italy)
Istituto di Struttura della Materia, CNR, (Italy)
-
- Posts: 29
- Joined: Wed Apr 01, 2009 6:29 pm
Re: Does error suggest computational limits?
Thanks for the information,
I want to contribute one last piece of information. Earlier, I posted the error
Anyway, I hope this may suggest something useful to you.
Cheers
I want to contribute one last piece of information. Earlier, I posted the error
I dug through the cluster information and found that there is 16 GB per core. Since I am not clear on how memory is allocated in the code, I cannot really venture any meaningful statement on how this relates to the allocation failure. Perhaps there are compiler issues that I have stumbled across.<07s> [M 2.681 Gb] Alloc bare_qpg (2.476)_pmii_daemon(SIGCHLD): PE 252 exit signal Killed
Anyway, I hope this may suggest something useful to you.
Cheers
Jeff Mullen
NCSU Physics
NCSU Physics
- myrta gruning
- Posts: 242
- Joined: Tue Mar 17, 2009 11:38 am
- Contact:
Re: Does error suggest computational limits?
Hallo
If it is 16Gb per core then the problem is not the allocation of that matrix. But maybe there are also other large matrices, and the problem may be their sum (are there other alloc messages in the log?).
Maybe you can try to launch the job and follow it with top shell command. In that way you may check how much memory is used, even though since the error happens after just 7 sec it may be difficult to "observe" the job running.
Also you can indeed see what happens by reducing the MaxGvecs (just to check if the problem is indeed the allocation of matrices)
Cheers,
m
If it is 16Gb per core then the problem is not the allocation of that matrix. But maybe there are also other large matrices, and the problem may be their sum (are there other alloc messages in the log?).
Maybe you can try to launch the job and follow it with top shell command. In that way you may check how much memory is used, even though since the error happens after just 7 sec it may be difficult to "observe" the job running.
Also you can indeed see what happens by reducing the MaxGvecs (just to check if the problem is indeed the allocation of matrices)
Cheers,
m
Dr Myrta Grüning
School of Mathematics and Physics
Queen's University Belfast - Northern Ireland
http://www.researcherid.com/rid/B-1515-2009
School of Mathematics and Physics
Queen's University Belfast - Northern Ireland
http://www.researcherid.com/rid/B-1515-2009
-
- Posts: 29
- Joined: Wed Apr 01, 2009 6:29 pm
Re: Does error suggest computational limits?
Greetings
The error is related to the value of FFTGvecs and EXXRLvcs. I am increasing these values, the number of bands included in the calculation and the k mesh density, in a systematic process to test for convergence. I am attempting to close the gap at the Dirac point in graphene (the existence of the gap obviously suggest a non-converged calculation). However, I have been unable to do so because I reach a wall where the yambo calculation encounters the allocation problem before the gap closes.
Cheers
The error is related to the value of FFTGvecs and EXXRLvcs. I am increasing these values, the number of bands included in the calculation and the k mesh density, in a systematic process to test for convergence. I am attempting to close the gap at the Dirac point in graphene (the existence of the gap obviously suggest a non-converged calculation). However, I have been unable to do so because I reach a wall where the yambo calculation encounters the allocation problem before the gap closes.
Cheers
Jeff Mullen
NCSU Physics
NCSU Physics
- andrea marini
- Posts: 325
- Joined: Mon Mar 16, 2009 4:27 pm
- Contact:
Re: Does error suggest computational limits?
Jeff, it is hard to say anything of accurate without having a look at the details of the calculation. Do you mind posting the input, the log and the output of the crashed calculation ? We can try tpo infer from these what allocation is causing the problem.jmullen wrote: The error is related to the value of FFTGvecs and EXXRLvcs. I am increasing these values, the number of bands included in the calculation and the k mesh density, in a systematic process to test for convergence. I am attempting to close the gap at the Dirac point in graphene (the existence of the gap obviously suggest a non-converged calculation). However, I have been unable to do so because I reach a wall where the yambo calculation encounters the allocation problem before the gap closes.
Andrea MARINI
Istituto di Struttura della Materia, CNR, (Italy)
Istituto di Struttura della Materia, CNR, (Italy)
- Conor Hogan
- Posts: 111
- Joined: Tue Mar 17, 2009 12:17 pm
- Contact:
Re: Does error suggest computational limits?
Dear Jeff,
In the meantime, I suggest you run a small F90 program (or better, a small MPI program) that allocates, fills and deallocates successively larger and larger matrices - maybe there is a problem with large single array sizes. In any case, its useful to see how much RAM you really have access to.
Conor
In the meantime, I suggest you run a small F90 program (or better, a small MPI program) that allocates, fills and deallocates successively larger and larger matrices - maybe there is a problem with large single array sizes. In any case, its useful to see how much RAM you really have access to.
Conor
Dr. Conor Hogan
CNR-ISM, via Fosso del Cavaliere, 00133 Roma, Italy;
Department of Physics and European Theoretical Spectroscopy Facility (ETSF),
University of Rome "Tor Vergata".
CNR-ISM, via Fosso del Cavaliere, 00133 Roma, Italy;
Department of Physics and European Theoretical Spectroscopy Facility (ETSF),
University of Rome "Tor Vergata".