Memory allocation error in parallel

jjmm1974 · Post by **jjmm1974** » Wed Mar 18, 2015 1:36 pm

Dear all

I have done the tutorials and now want to apply then to my sistem, which has 80 atoms in the unit cell. After initializing, I wanted to calculate the HF correction, as in the Si tutorial. Then I typed "yambo -x", reduced the number of G-vectors and launched yambo as

mpirun -np 8 yambo

After a few seconds, the code fails with the error message:

[ERROR] STOP signal received while in :[04] Bare local and non-local Exchange-Correlation
[ERROR]Mem All. failed. Element WF require 0.00000 [Gb]

I have read in the forum that this error could be related to very high G's or band numbers. So I quite reduced the number of G-vectors, number of bands (up to just one) and of k-points, even to unreallistic values, but nothing changed. However, the code run perfectly in serial.

Do you have any hints about what may be going on?

Some more data:
1) I got the databases from a fully converged Quantum Espresso calculation through "p2y -S -N". I do not think there is anything wrong with this, because the code works in serial execution.
2) I have a big number of G-vectors because my PP's needed a 200 Ry cutoff for convergence. Also the number of bands in the QE calculation is large, but I reduced it in the yambo run.
3) Please find attached the file.zip file containing the reports for the parallel and serial run, the setup and the LIST_log and l_dbs files.
4) The parallel version works fine in the tutorials.

Thank you very much in advance.

Juanjo

Juan J. Meléndez
Associate Professor
Department of Physics · University of Extremadura
Avda. de Elvas, s/n 06006 Badajoz (Spain)
Phone: +34 924 28 96 55
Fax: +34 924 28 96 51
Email: melendez@unex.es
Web: http://materiales.unex.es/miembros/pers ... Index.html

Post by **Daniele Varsano** » Wed Mar 18, 2015 3:09 pm

Dear Juanjo,
Despite the message reminding to a memory issue, my feeling is that it is not a memory issue.
Can you try to run with e.g. 2 processor only?
Could you please post the input file of the crashed run? I will have look, if I cannot spot the problem, we will need to reproduce the error,
so next I would also ask for your scf/nscf input files and pseudos.

Best,
Daniele

jjmm1974 · Post by **jjmm1974** » Wed Mar 18, 2015 4:11 pm

Hi Daniele
Thank you for replying so quickly.
The test with 2 processors also failed. I am attaching the input file for the crashed run, the reports for 2 procs, the input files for the scf and bands calculations and the pp's files.
The scf calculation converges with k_points 444. I just took the same for nscf for simplicity. I also took one single point and one band in the input file for yambo for simplicity.
Best,
Juanjo

Post by **Daniele Varsano** » Wed Mar 18, 2015 4:24 pm

Thanks Juanjo,
it seems an huge calculations, how much does it take the ground state calculation?
If it is too long, can you try to reduce the sampling and see if the error persists, it will be easiest for debug purpose.
Otherwise, I will try to reproduce your error in the next days.
Best,
Daniele

jjmm1974 · Post by **jjmm1974** » Wed Mar 18, 2015 4:29 pm

Hi
The scf+nscf takes like 5 hours for 444. It is indeed relatively large, because the system is formed by 80 atoms and the PPs require large cutoffs.
Let me reduce the size of the model and try again. I will reach you back when done.
Thanks again
Juanjo

jjmm1974 · Post by **jjmm1974** » Wed Mar 18, 2015 6:16 pm

Daniele, I reduced the size of the system and the calculation did not crash.

The problem arises so as to perform the yambo calculations from some realistic scf outputs. In my case, I need something like 200 Ry cutoff and a 4x4x4 MP mesh for convergence. These data are those which produced the crash before. Is there any way to get rid of this error?

Thanks in advance

Juanjo

Post by **Daniele Varsano** » Wed Mar 18, 2015 6:43 pm

Dear juanjo
I will try to reproduce your error and let you know. Anyway can you send me the l.* output of your serial calculation?
From there I can realize if it is indeed a memory issue, as now it looks to be.
Best
Daniele

jjmm1974 · Post by **jjmm1974** » Wed Mar 18, 2015 7:40 pm

Hi Daniele
Now I cannot reproduce the serial calculation. Actually I don't get much information:

____ ____ ___ .___ ___. .______ ______
\ \ / / / \ | \/ | | _ \ / __ \
\ \/ / / ^ \ | \ / | | |_) | | | | |
\_ _/ / /_\ \ | |\/| | | _ < | | | |
| | / _____ \ | | | | | |_) | | `--" |
|__| /__/ \__\ |__| |__| |______/ \______/

<---> [01] Files & I/O Directories
<---> [02] CORE Variables Setup
<---> [02.01] Unit cells
<---> [02.02] Symmetries
<---> [02.03] RL shells
<01s> [02.04] K-grid lattice
<01s> [02.05] Energies [ev] & Occupations
<01s> [03] Transferred momenta grid
<01s> [04] Bare local and non-local Exchange-Correlation
<04s> [M 5.599 Gb] Alloc WF ( 5.524)
<06s> [FFT-HF/Rho] Mesh size: 77 77 77
<06s> [WF-HF/Rho loader] Wfs (re)loading | | [000%] --(E) --(X)
<06s> [M 5.754 Gb] Alloc wf_disk ( 0.148)Killed

Post by **Daniele Varsano** » Wed Mar 18, 2015 7:56 pm

Dear Juanjo,
ok probably it is a memory problem:
in your serial calculations I can see:

Code: Select all

 [M 5.599 Gb] Alloc WF ( 5.524)

so you need more than 5Gb to allocate the wfs.
It looks that the serial run, can allocate it, when you run with more processor there is not enough space in the node.
May be you can have a look to the specific of your machine (RAM per cpu), and possibly reserve an intere node in oder to have more memory per CPU, or
use more nodes.
Workarounds:
1) Try to run the code with the -M option: i.e.:

Code: Select all

mpirun -n 2 ./yambo -M

or more than 2, I would not use all the cpu of the node.

2) You have an enormous amount og G-vector:

Code: Select all

 G-vectors             [RL space]: 1369167

but actually most probably you do not need all the G-vector of the charge, and the one to describe the wfs are enough:

Code: Select all

  WF G-vectors                    : 182950

so you can either delete your ./SAVE ndb.gops database redo the steup adding in your input file the variable:

Code: Select all

MaxGvecs=182950

or either run your HF calculations by adding the FFTGvecs variable:

Code: Select all

HF_and_locXC                 # [R XX] Hartree-Fock Self-energy and Vxc
FFTGvecs=182950
EXXRLvcs= 135          RL    # [XX] Exchange RL components
%QPkrange                    # [GW] QP generalized Kpoint/Band indices
  1|  1|115|116|
%
%QPerange                    # [GW] QP generalized Kpoint/Energy indices
  1|  1| 0.0|-1.0|
%

I would first try with this last option.
Hope this can solve the problem.
Best,
Daniele

jjmm1974 · Post by **jjmm1974** » Thu Mar 19, 2015 1:28 pm

Dear Daniele

It works! I used both of your suggestions and got a HF calculation for the system. So it is definitely a memory issue.

Just to make sure that I am understanding right, let me ask you a couple of more questions:

1) I made "yambo -i -V RL" and set "MaxGvecs=182950", which is the maximum numbre of G-vectors used for the wavefunctions in my scf file. Then I ran "yambo" for a setup with limited G's. I guess that this limits the amount of memory required to allocate the G-variable. Is this right?

2) Setting "FFTGvecs=182950" in "yambo.in" does the same thing, but does not reduce the amount of memory reserved for the G's. So in this second case my initialization would take longer and memory would be reserved for the whole original 1369167 G vectors. Is this right?

Thank you very much. You saved my day today!

Juanjo

Yambo Community Forum

Memory allocation error in parallel

Memory allocation error in parallel

Re: Memory allocation error in parallel

Re: Memory allocation error in parallel

Re: Memory allocation error in parallel

Re: Memory allocation error in parallel

Re: Memory allocation error in parallel

Re: Memory allocation error in parallel

Re: Memory allocation error in parallel

Re: Memory allocation error in parallel

Re: Memory allocation error in parallel