Memory allocation error in parallel
Moderators: myrta gruning, andrea marini, Daniele Varsano, Conor Hogan
-
- Posts: 24
- Joined: Tue Jun 17, 2014 11:41 am
Memory allocation error in parallel
Dear all
I have done the tutorials and now want to apply then to my sistem, which has 80 atoms in the unit cell. After initializing, I wanted to calculate the HF correction, as in the Si tutorial. Then I typed "yambo -x", reduced the number of G-vectors and launched yambo as
mpirun -np 8 yambo
After a few seconds, the code fails with the error message:
[ERROR] STOP signal received while in :[04] Bare local and non-local Exchange-Correlation
[ERROR]Mem All. failed. Element WF require 0.00000 [Gb]
I have read in the forum that this error could be related to very high G's or band numbers. So I quite reduced the number of G-vectors, number of bands (up to just one) and of k-points, even to unreallistic values, but nothing changed. However, the code run perfectly in serial.
Do you have any hints about what may be going on?
Some more data:
1) I got the databases from a fully converged Quantum Espresso calculation through "p2y -S -N". I do not think there is anything wrong with this, because the code works in serial execution.
2) I have a big number of G-vectors because my PP's needed a 200 Ry cutoff for convergence. Also the number of bands in the QE calculation is large, but I reduced it in the yambo run.
3) Please find attached the file.zip file containing the reports for the parallel and serial run, the setup and the LIST_log and l_dbs files.
4) The parallel version works fine in the tutorials.
Thank you very much in advance.
Juanjo
Juan J. Meléndez
Associate Professor
Department of Physics · University of Extremadura
Avda. de Elvas, s/n 06006 Badajoz (Spain)
Phone: +34 924 28 96 55
Fax: +34 924 28 96 51
Email: melendez@unex.es
Web: http://materiales.unex.es/miembros/pers ... Index.html
I have done the tutorials and now want to apply then to my sistem, which has 80 atoms in the unit cell. After initializing, I wanted to calculate the HF correction, as in the Si tutorial. Then I typed "yambo -x", reduced the number of G-vectors and launched yambo as
mpirun -np 8 yambo
After a few seconds, the code fails with the error message:
[ERROR] STOP signal received while in :[04] Bare local and non-local Exchange-Correlation
[ERROR]Mem All. failed. Element WF require 0.00000 [Gb]
I have read in the forum that this error could be related to very high G's or band numbers. So I quite reduced the number of G-vectors, number of bands (up to just one) and of k-points, even to unreallistic values, but nothing changed. However, the code run perfectly in serial.
Do you have any hints about what may be going on?
Some more data:
1) I got the databases from a fully converged Quantum Espresso calculation through "p2y -S -N". I do not think there is anything wrong with this, because the code works in serial execution.
2) I have a big number of G-vectors because my PP's needed a 200 Ry cutoff for convergence. Also the number of bands in the QE calculation is large, but I reduced it in the yambo run.
3) Please find attached the file.zip file containing the reports for the parallel and serial run, the setup and the LIST_log and l_dbs files.
4) The parallel version works fine in the tutorials.
Thank you very much in advance.
Juanjo
Juan J. Meléndez
Associate Professor
Department of Physics · University of Extremadura
Avda. de Elvas, s/n 06006 Badajoz (Spain)
Phone: +34 924 28 96 55
Fax: +34 924 28 96 51
Email: melendez@unex.es
Web: http://materiales.unex.es/miembros/pers ... Index.html
You do not have the required permissions to view the files attached to this post.
Juan J. Meléndez
Department of Physics
University of Extremadura
Avda. de Elvas, s/n 06006 Badajoz (Spain)
Email: melendez@unex.es
Department of Physics
University of Extremadura
Avda. de Elvas, s/n 06006 Badajoz (Spain)
Email: melendez@unex.es
- Daniele Varsano
- Posts: 4213
- Joined: Tue Mar 17, 2009 2:23 pm
- Contact:
Re: Memory allocation error in parallel
Dear Juanjo,
Despite the message reminding to a memory issue, my feeling is that it is not a memory issue.
Can you try to run with e.g. 2 processor only?
Could you please post the input file of the crashed run? I will have look, if I cannot spot the problem, we will need to reproduce the error,
so next I would also ask for your scf/nscf input files and pseudos.
Best,
Daniele
Despite the message reminding to a memory issue, my feeling is that it is not a memory issue.
Can you try to run with e.g. 2 processor only?
Could you please post the input file of the crashed run? I will have look, if I cannot spot the problem, we will need to reproduce the error,
so next I would also ask for your scf/nscf input files and pseudos.
Best,
Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/
-
- Posts: 24
- Joined: Tue Jun 17, 2014 11:41 am
Re: Memory allocation error in parallel
Hi Daniele
Thank you for replying so quickly.
The test with 2 processors also failed. I am attaching the input file for the crashed run, the reports for 2 procs, the input files for the scf and bands calculations and the pp's files.
The scf calculation converges with k_points 444. I just took the same for nscf for simplicity. I also took one single point and one band in the input file for yambo for simplicity.
Best,
Juanjo
Thank you for replying so quickly.
The test with 2 processors also failed. I am attaching the input file for the crashed run, the reports for 2 procs, the input files for the scf and bands calculations and the pp's files.
The scf calculation converges with k_points 444. I just took the same for nscf for simplicity. I also took one single point and one band in the input file for yambo for simplicity.
Best,
Juanjo
You do not have the required permissions to view the files attached to this post.
Juan J. Meléndez
Department of Physics
University of Extremadura
Avda. de Elvas, s/n 06006 Badajoz (Spain)
Email: melendez@unex.es
Department of Physics
University of Extremadura
Avda. de Elvas, s/n 06006 Badajoz (Spain)
Email: melendez@unex.es
- Daniele Varsano
- Posts: 4213
- Joined: Tue Mar 17, 2009 2:23 pm
- Contact:
Re: Memory allocation error in parallel
Thanks Juanjo,
it seems an huge calculations, how much does it take the ground state calculation?
If it is too long, can you try to reduce the sampling and see if the error persists, it will be easiest for debug purpose.
Otherwise, I will try to reproduce your error in the next days.
Best,
Daniele
it seems an huge calculations, how much does it take the ground state calculation?
If it is too long, can you try to reduce the sampling and see if the error persists, it will be easiest for debug purpose.
Otherwise, I will try to reproduce your error in the next days.
Best,
Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/
-
- Posts: 24
- Joined: Tue Jun 17, 2014 11:41 am
Re: Memory allocation error in parallel
Hi
The scf+nscf takes like 5 hours for 444. It is indeed relatively large, because the system is formed by 80 atoms and the PPs require large cutoffs.
Let me reduce the size of the model and try again. I will reach you back when done.
Thanks again
Juanjo
The scf+nscf takes like 5 hours for 444. It is indeed relatively large, because the system is formed by 80 atoms and the PPs require large cutoffs.
Let me reduce the size of the model and try again. I will reach you back when done.
Thanks again
Juanjo
Juan J. Meléndez
Department of Physics
University of Extremadura
Avda. de Elvas, s/n 06006 Badajoz (Spain)
Email: melendez@unex.es
Department of Physics
University of Extremadura
Avda. de Elvas, s/n 06006 Badajoz (Spain)
Email: melendez@unex.es
-
- Posts: 24
- Joined: Tue Jun 17, 2014 11:41 am
Re: Memory allocation error in parallel
Daniele, I reduced the size of the system and the calculation did not crash.
The problem arises so as to perform the yambo calculations from some realistic scf outputs. In my case, I need something like 200 Ry cutoff and a 4x4x4 MP mesh for convergence. These data are those which produced the crash before. Is there any way to get rid of this error?
Thanks in advance
Juanjo
The problem arises so as to perform the yambo calculations from some realistic scf outputs. In my case, I need something like 200 Ry cutoff and a 4x4x4 MP mesh for convergence. These data are those which produced the crash before. Is there any way to get rid of this error?
Thanks in advance
Juanjo
Juan J. Meléndez
Department of Physics
University of Extremadura
Avda. de Elvas, s/n 06006 Badajoz (Spain)
Email: melendez@unex.es
Department of Physics
University of Extremadura
Avda. de Elvas, s/n 06006 Badajoz (Spain)
Email: melendez@unex.es
- Daniele Varsano
- Posts: 4213
- Joined: Tue Mar 17, 2009 2:23 pm
- Contact:
Re: Memory allocation error in parallel
Dear juanjo
I will try to reproduce your error and let you know. Anyway can you send me the l.* output of your serial calculation?
From there I can realize if it is indeed a memory issue, as now it looks to be.
Best
Daniele
I will try to reproduce your error and let you know. Anyway can you send me the l.* output of your serial calculation?
From there I can realize if it is indeed a memory issue, as now it looks to be.
Best
Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/
-
- Posts: 24
- Joined: Tue Jun 17, 2014 11:41 am
Re: Memory allocation error in parallel
Hi Daniele
Now I cannot reproduce the serial calculation. Actually I don't get much information:
____ ____ ___ .___ ___. .______ ______
\ \ / / / \ | \/ | | _ \ / __ \
\ \/ / / ^ \ | \ / | | |_) | | | | |
\_ _/ / /_\ \ | |\/| | | _ < | | | |
| | / _____ \ | | | | | |_) | | `--" |
|__| /__/ \__\ |__| |__| |______/ \______/
<---> [01] Files & I/O Directories
<---> [02] CORE Variables Setup
<---> [02.01] Unit cells
<---> [02.02] Symmetries
<---> [02.03] RL shells
<01s> [02.04] K-grid lattice
<01s> [02.05] Energies [ev] & Occupations
<01s> [03] Transferred momenta grid
<01s> [04] Bare local and non-local Exchange-Correlation
<04s> [M 5.599 Gb] Alloc WF ( 5.524)
<06s> [FFT-HF/Rho] Mesh size: 77 77 77
<06s> [WF-HF/Rho loader] Wfs (re)loading | | [000%] --(E) --(X)
<06s> [M 5.754 Gb] Alloc wf_disk ( 0.148)Killed
Now I cannot reproduce the serial calculation. Actually I don't get much information:
____ ____ ___ .___ ___. .______ ______
\ \ / / / \ | \/ | | _ \ / __ \
\ \/ / / ^ \ | \ / | | |_) | | | | |
\_ _/ / /_\ \ | |\/| | | _ < | | | |
| | / _____ \ | | | | | |_) | | `--" |
|__| /__/ \__\ |__| |__| |______/ \______/
<---> [01] Files & I/O Directories
<---> [02] CORE Variables Setup
<---> [02.01] Unit cells
<---> [02.02] Symmetries
<---> [02.03] RL shells
<01s> [02.04] K-grid lattice
<01s> [02.05] Energies [ev] & Occupations
<01s> [03] Transferred momenta grid
<01s> [04] Bare local and non-local Exchange-Correlation
<04s> [M 5.599 Gb] Alloc WF ( 5.524)
<06s> [FFT-HF/Rho] Mesh size: 77 77 77
<06s> [WF-HF/Rho loader] Wfs (re)loading | | [000%] --(E) --(X)
<06s> [M 5.754 Gb] Alloc wf_disk ( 0.148)Killed
Juan J. Meléndez
Department of Physics
University of Extremadura
Avda. de Elvas, s/n 06006 Badajoz (Spain)
Email: melendez@unex.es
Department of Physics
University of Extremadura
Avda. de Elvas, s/n 06006 Badajoz (Spain)
Email: melendez@unex.es
- Daniele Varsano
- Posts: 4213
- Joined: Tue Mar 17, 2009 2:23 pm
- Contact:
Re: Memory allocation error in parallel
Dear Juanjo,
ok probably it is a memory problem:
in your serial calculations I can see:
so you need more than 5Gb to allocate the wfs.
It looks that the serial run, can allocate it, when you run with more processor there is not enough space in the node.
May be you can have a look to the specific of your machine (RAM per cpu), and possibly reserve an intere node in oder to have more memory per CPU, or
use more nodes.
Workarounds:
1) Try to run the code with the -M option: i.e.:
or more than 2, I would not use all the cpu of the node.
2) You have an enormous amount og G-vector:
but actually most probably you do not need all the G-vector of the charge, and the one to describe the wfs are enough:
so you can either delete your ./SAVE ndb.gops database redo the steup adding in your input file the variable:
or either run your HF calculations by adding the FFTGvecs variable:
I would first try with this last option.
Hope this can solve the problem.
Best,
Daniele
ok probably it is a memory problem:
in your serial calculations I can see:
Code: Select all
[M 5.599 Gb] Alloc WF ( 5.524)
It looks that the serial run, can allocate it, when you run with more processor there is not enough space in the node.
May be you can have a look to the specific of your machine (RAM per cpu), and possibly reserve an intere node in oder to have more memory per CPU, or
use more nodes.
Workarounds:
1) Try to run the code with the -M option: i.e.:
Code: Select all
mpirun -n 2 ./yambo -M
2) You have an enormous amount og G-vector:
Code: Select all
G-vectors [RL space]: 1369167
Code: Select all
WF G-vectors : 182950
Code: Select all
MaxGvecs=182950
Code: Select all
HF_and_locXC # [R XX] Hartree-Fock Self-energy and Vxc
FFTGvecs=182950
EXXRLvcs= 135 RL # [XX] Exchange RL components
%QPkrange # [GW] QP generalized Kpoint/Band indices
1| 1|115|116|
%
%QPerange # [GW] QP generalized Kpoint/Energy indices
1| 1| 0.0|-1.0|
%
Hope this can solve the problem.
Best,
Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/
-
- Posts: 24
- Joined: Tue Jun 17, 2014 11:41 am
Re: Memory allocation error in parallel
Dear Daniele
It works! I used both of your suggestions and got a HF calculation for the system. So it is definitely a memory issue.
Just to make sure that I am understanding right, let me ask you a couple of more questions:
1) I made "yambo -i -V RL" and set "MaxGvecs=182950", which is the maximum numbre of G-vectors used for the wavefunctions in my scf file. Then I ran "yambo" for a setup with limited G's. I guess that this limits the amount of memory required to allocate the G-variable. Is this right?
2) Setting "FFTGvecs=182950" in "yambo.in" does the same thing, but does not reduce the amount of memory reserved for the G's. So in this second case my initialization would take longer and memory would be reserved for the whole original 1369167 G vectors. Is this right?
Thank you very much. You saved my day today!
Juanjo
It works! I used both of your suggestions and got a HF calculation for the system. So it is definitely a memory issue.
Just to make sure that I am understanding right, let me ask you a couple of more questions:
1) I made "yambo -i -V RL" and set "MaxGvecs=182950", which is the maximum numbre of G-vectors used for the wavefunctions in my scf file. Then I ran "yambo" for a setup with limited G's. I guess that this limits the amount of memory required to allocate the G-variable. Is this right?
2) Setting "FFTGvecs=182950" in "yambo.in" does the same thing, but does not reduce the amount of memory reserved for the G's. So in this second case my initialization would take longer and memory would be reserved for the whole original 1369167 G vectors. Is this right?
Thank you very much. You saved my day today!
Juanjo
Juan J. Meléndez
Department of Physics
University of Extremadura
Avda. de Elvas, s/n 06006 Badajoz (Spain)
Email: melendez@unex.es
Department of Physics
University of Extremadura
Avda. de Elvas, s/n 06006 Badajoz (Spain)
Email: melendez@unex.es