NETCDF problem in Yambo 3.2.3 r.696 with fragmented database

vormar · Post by **vormar** » Mon Nov 15, 2010 9:35 am

I downloaded the recent Yambo version and faced the following interesting behavior.
I always compile Yambo on cineca sp6 with basically the same makefile that was used on that machine to
compile "yambo/3.2.2". Here is my compilation script:

Code: Select all

module purge
module load xl/10.1
module load netcdf/4.0.1--xl--10.1 fftw/3.2.2--xl--10.1 blacs/1.1--xl--10.1 lapack/3.2.1--xl--10.1 scalapack/1.8.0--xl--10.1

export CPP=cpp
export CC=xlc_r
export F77=xlf_r
export FC=xlf90_r
export FCFLAGS='-O2 -q64 -qstrict -qarch=pwr6 -qtune=pwr6 -qmaxmem=-1 -qsuffix=f=f'
export ESPRESSO_IOTK=/cineca/prod/build/applications/QuantumESPRESSO/4.1/xl--10.1/BA_WORK/QuantumESPRESSO-4.1/iotk

./configure  --with-fftw=${FFTW_LIB} --with-netcdf-lib=${NETCDF_LIB} --with-netcdf-include=${NETCDF_INC} --with-blacs=${BLACS_LIB} --with-scalapack=${SCALAPACK_LIB} --with-p2y=4.0 --with-iotk=${ESPRESSO_IOTK} --build=powerpc-ibm

gmake yambo
gmake interfaces
gmake ypp

After configuring the code reports that NETCDF support is correctly switched on: "[X ] NETCDF/Large Files" and my Makefile contains the following line: "netcdf = -D_NETCDF_IO". "p2y -g" generates the following files:

Code: Select all

ns.wf
ns.wf_fragments_1_1
ns.db1

It is interesting that only one fragment appears, since the output of "p2y -g" contains the following:

Code: Select all

 <---> == Writing DB1 ...
 <---> == DB2 (wavefunctions) ...
 <---> :: WF splitter Blocks/bands/block size(Mb):    2   500   227

and also according to "yambo -D" there should be two blocks:

Code: Select all

[RD./SAVE//ns.db1]------------------------------------------
 Bands                           : 1000
 K-points                        : 1
 G-vectors             [RL space]:  475489
 Components       [wavefunctions]:  59589
 Symmetries       [spatial+T-rev]:  48
 Spinor components               : 1
 Spin polarizations              : 1
 Temperature                 [ev]: 0.000000
 Electrons                       : 56.00000
 WF G-vectors                    :  59589
 Max atoms/species               : 16
 No. of atom species             : 2
- S/N 006556 ---------------------------- v.03.02.03 r.696 -
[RD./SAVE//ns.wf]-------------------------------------------
 Bands in each block             :  500
 Blocks                          : 2
- S/N 006556 ---------------------------- v.03.02.03 r.696 -
[RD./SAVE//ndb.kindx]---------------------------------------
 Polarization last K   : 1
 QP states             : 1  1
 X grid is uniform     :yes
 BS scattering         :no
- S/N 006556 ---------------------------- v.03.02.03 r.696 -
[RD./SAVE//ndb.cutoff]--------------------------------------
 Brillouin Zone Q/K grids (IBZ/BZ): 1  1  1  1
 CutOff Geometry                 :sphere xyz
 Box sides length            [au]: 0.000     0.000     0.000
 Sphere/Cylinder radius      [au]: 16.00000
 Cylinder length             [au]: 0.000000
 RL components                   :  59589
 RL components used in the sum   : 119177
 RIM corrections included        :no
 RIM RL components               :0
 RIM random points               :0
- S/N 006556 ---------------------------- v.03.02.03 r.696 -

When I try to run eg. a GW plasmon-pole calculation I got the following error (similar error appears if I try
to run a BSE calculation):

Code: Select all

[ERROR] STOP signal received while in :[05] Bare local and non-local Exchange-Correlation
[ERROR][NetCDF] NetCDF: Variable not found

Do you have any clue what might cause this? Am I missing something? Just in case this info is useful: the size of the "ns.wf_fragments_1_1" file is around 120MB so I assume that there is no large file problem.

If needed, I can try to make inputs and databases with reduced number of bands, G-vectors.

Thanks,
Marton

Post by **andrea marini** » Mon Nov 15, 2010 10:03 am

Dear Marton, I cannot reproduce the error (of course, otherwise everything would be much more easy

). First think to note is that NETCDF and fragmented databases are not correlated. You can fragment even without NETCDF support.

Now. The first think to do is to check if you have all variables. If you apply the ">ncdump -h <DATABASE>" to your databases you should see something like:

Code: Select all

SAVE>ncdump -h ns.wf
netcdf ns {
dimensions:
	D_0000000001 = 1 ;
	D_0000000003 = 3 ;
	D_0000000002 = 2 ;
variables:
	float FRAGMENTED(D_0000000001) ;
	float HEAD_VERSION(D_0000000003) ;
	float HEAD_REVISION(D_0000000001) ;
	float SERIAL_NUMBER(D_0000000001) ;
	float SPIN_VARS(D_0000000002) ;
	float BAND_GROUPS(D_0000000002) ;
}
SAVE>ncdump -h ns.wf_fragments_1_1
netcdf ns {
dimensions:
	D_0000000050 = 50 ;
	D_0000000745 = 745 ;
	D_0000000001 = 1 ;
variables:
	float WF_REAL_COMPONENTS_@_K1_BAND_GRP_1(D_0000000001, D_0000000745, D_0000000050) ;
	float WF_IM_COMPONENTS_@_K1_BAND_GRP_1(D_0000000001, D_0000000745, D_0000000050) ;
}

Can you check that you have the same structure and that the variables data field is not empty ?

THX!

Andrea

vormar · Post by **vormar** » Tue Nov 16, 2010 12:45 am

Dear Andrea,

Thanks for your quick answer. I played around with a code a bit and it seems that I forgot to report key details.
Now, in order to see the effect of linking against the netCDF library I compiled the code without netCDF support and the problem disappeared.
Furthermore, I recalculated everything again using the netCDF library.

The basic problem that turns out to be a key issue is that I ran these calculations interactively and
I just didn't take care of error messages after the run. Here is a summary of what I tried:

p2y with netCDF:

Code: Select all

-bash-3.2$ p2y -N 
llsubmit: Processed command file through Submit Filter: "/cineca/usr/loadl/deisa/exits/pre_filter.pl".
ERROR: 0031-161  EOF on socket connection with node sp0202

Result: only the first fragment is generated and that one also misses some components.
According to IBM, the error message means that "Processing continues. The socket used
to connect the Home Node with the indicated remote task has closed. Probably the
remote node has closed the connection."

Here is the output of ncdump applied to these databases:

Code: Select all

-bash-3.2$ ncdump -h ns.wf
netcdf ns {
dimensions:
        D_0000000001 = 1 ;
        D_0000000003 = 3 ;
        D_0000000002 = 2 ;
variables:
        float FRAGMENTED(D_0000000001) ;
        float HEAD_VERSION(D_0000000003) ;
        float HEAD_REVISION(D_0000000001) ;
        float SERIAL_NUMBER(D_0000000001) ;
        float SPIN_VARS(D_0000000002) ;
        float BAND_GROUPS(D_0000000002) ;
}
-bash-3.2$ ncdump -h ns.wf_fragments_1_1
netcdf ns {
dimensions:
        D_0000000500 = 500 ;
        D_0000059589 = 59589 ;
        D_0000000001 = 1 ;
variables:
        float WF_REAL_COMPONENTS_@_K1_BAND_GRP_1(D_0000000001, D_0000059589, D_0000000500) ;
}

p2y w/o netCDF:

Code: Select all

-bash-3.2$ p2y -N
llsubmit: Processed command file through Submit Filter: "/cineca/usr/loadl/deisa/exits/pre_filter.pl".
ERROR: 0031-250  task 0: Illegal instruction

Result: database seems to be fine, though there is "core.txt" file in the input directory suggesting that
there was an error during the run.

If I submitted these jobs without "-N" to the queue system on one single node, then
error files disappared and the SAVE directory always contained the correct fragmanted database.

Interestingly, when I used the "-g" option, I always got a "core.txt" file (even when I submitted the
job to queue), though the generated databases seem to be fine in the case of the submitted job.

Finally, I have a question: why is fragmentation switched on automatically if the database
exceeds a given size and what is the advantage
of having fragmented database?

Thanks,
Marton

Post by **Conor Hogan** » Tue Nov 23, 2010 5:07 pm

Finally, I have a question: why is fragmentation switched on automatically if the database
exceeds a given size and what is the advantage
of having fragmented database?

If I remember correctly, there is some forced fragmentation into blocks (of bands) simply to keep the size of the wavefunction arrays held in memory to a manageable level. It's presently hardcoded into the interfaces at interfaces/int_modules/mod_wf2y.F :

Code: Select all

  real(SP),parameter :: max_wf_block_size=400. ! MB

Without this, we found that the interfaces would often exceed the usual virtual memory available on clusters with 2Gb/core. Due to the way netCDF is called, 800Mb would need when writing each block, on top of other temporary arrays. This action could certainly be improved!

Fragmentation over k-points on the other hand is switched on with -S. This is clearly to allow direct access to particular wavefunctions without having to read all previous k-points (the databases are not in "direct access" fortran). I should mention that since Yambo supports databases in both netCDF and native binary fortran, a lot of things that might seem obvious to improve performance are not so obvious in practice!

We are currently doing some development on improving the distribution of memory in parallel for various parts of the code.

Conor

Yambo Community Forum

NETCDF problem in Yambo 3.2.3 r.696 with fragmented database

NETCDF problem in Yambo 3.2.3 r.696 with fragmented database

Re: NETCDF problem in Yambo 3.2.3 r.696 with fragmented database

Re: NETCDF problem in Yambo 3.2.3 r.696 with fragmented database

Re: NETCDF problem in Yambo 3.2.3 r.696 with fragmented database