Yambo Community Forum

Posted: **Tue Jul 28, 2015 3:07 pm**

Dear Yambo Developers,

I am having some problems in running Yambo (v4.0.1 r.88) with Netcdf libraries on a Cray machine.
I have successfully configured and compiled the code by linking with the cray-optimized Netcdf and HDF5 libraries. In particular, after running the configure script with the machine-specific flags, I get the following:

Code: Select all

#
# [VER] 4.0.1 r.88
#
# - GENERAL CONFIGURATIONS -
#
# [SYS] linux@x86_64
# [SRC] /home/e411/e411/fcaruso/espresso-5.1.2/yambo-4.0.1-epl
# [BIN] /home/e411/e411/fcaruso/espresso-5.1.2/yambo-4.0.1-epl/bin
# [-] Double precision
# [X] Redundant compilation 
# [-] Run-Time timing profile
#
# - PARALLEL SUPPORT -
#
# [X] MPI (open-mpi kind)
# [-] OpenMP
# [-] Blue-Gene specific procedures
#
# - LIBRARIES (E=external library; I=internal library; -=not used;) -
#
#  I/O
# [ E ] IOTK   : /home/e411/e411/fcaruso/espresso-5.1.2/iotk//src/libiotk.a (QE 5.0)
# [ - ] ETSF_IO: 
# [ E ] NETCDF : -L/opt/cray/netcdf-hdf5parallel/4.3.2/INTEL/140//lib -lnetcdff -lnetcdf (No large files support)
# [ E ] HDF5   : -L/opt/cray/hdf5-parallel/1.8.13/INTEL/140//lib -L/opt/cray/hdf5-parallel/1.8.13/INTEL/140//lib -lhdf5_fortran -lhdf5_hl -lhdf5 (No specific HDF5-IO support)
#
#  MATH
# [ E ] FFT      : -L/opt/cray/fftw/3.3.4.2/sandybridge//lib -lfftw3 (FFTW v3)
# [ E ] BLAS     : -L/opt/intel/composer_xe_2013_sp1.4.211/mkl/lib/intel64/ -lmkl_intel_lp64  -lmkl_sequential -lmkl_core 
# [ E ] LAPACK   : -L/opt/intel/composer_xe_2013_sp1.4.211/mkl/lib/intel64/ -Wl,--start-group -lmkl_intel_lp64  -lmkl_sequential -lmkl_core -Wl,--end-group -ldl
# [ E ] SCALAPACK: -L/opt/intel/composer_xe_2013_sp1.4.211/mkl/lib/intel64/ -lmkl_scalapack_lp64
#
#  OTHER
# [ I ] LibXC      : -lxc
# [ - ] MPI library: 
#
# - COMPILERS, MAKE and EDITOR -
#
# [ CPP ] cc -E -ansi -D_NETCDF_IO -D_MPI -D_FFTW -D_FFTW_OMP -D_SCALAPACK      -D_OPENMPI 
# [  C  ] cc -g -O2  -D_C_US -D_FORTRAN_US
# [MPICC] cc -g -O2  -D_C_US -D_FORTRAN_US
# [ F90 ] ftn -assume bscc -O2 -static -ip -nofor\_main  
# [MPIF ] ftn -assume bscc -O2 -static -ip -nofor\_main  
# [ F77 ] ftn -assume bscc -O2 -static -ip -nofor\_main
# [Cmain] -Mnomain
# [NoOpt]  -assume bscc -O0 -static -nofor\_main
#
# [ MAKE ] make
# [EDITOR] vim
#

"make yambo interfaces" generates the executables without problems.

However, when I run yambo for calculating the PPA quasi-particle corrections for Silicon I systematically run into the following error:

P0002: [ERROR] STOP signal received while in :[07] Dynamic Dielectric Matrix (PPA)
P0002: [ERROR][NetCDF] NetCDF: Unknown file format

This occurs whenever yambo finishes the computation of the first q point of the polarizability.
Additionally, the calculation does not crash and it keeps occupying the nodes of the cluster unless I delete the job manually.
I attach the input/output files of the calculation.

Any idea of the origin of this problem?

Thanks in advance for your help!

Best,
Fabio

Posted: **Tue Jul 28, 2015 4:02 pm**

Dear Fabio,
I do not know if is related to the NETDCF you linked.
Can you try to see if the problem persists when using a different parallelization strategy? (i.e. not all the cpu over the q's, or better avoiding to parallelize over q's)

Best,
Daniele

Posted: **Tue Jul 28, 2015 4:36 pm**

Hi Fabio,

actually this looks pretty much an issue we are investigating at the moment
(which is triggered by the q-parallelism in the calculation of X).

For the time being, as Daniele suggested, I would just avoid the q parallelism.
We'll let you know when the problem is fixed

ciao ciao
Andrea

Posted: **Tue Jul 28, 2015 5:03 pm**

Dear Daniele and Andrea,

thanks for your reply! The problem disappeared when I turned off the q-parallelization, as you said.
(However, the problem with the q-parallelization does not seem to depend on the libraries I have linked. I tried to reconfigure/recompile a different version of the netcdf and hdf5 libs available on Cray and the issue persists.)

Thanks a lot for your help (and for the amazing work on Yambo)!

Best,
Fabio

Posted: **Tue Jul 28, 2015 5:07 pm**

Hi Fabio,

thanks for reporting.
Indeed, the problem seems to be related to MPI communicators or alike, and just shows up related to IO, according to our experience, in a non-reproducible way (something like two tasks trying to write on the same file at the same moment)

take care
Andrea

Posted: **Mon Aug 17, 2015 12:49 am**

Hello,

I am experiencing the same problem in a very unpredictable way. Unfortunately, it is happening for well over half of my runs. I read the thread here and I do not think I am parallelizing over q points. I am following the descriptions in the tutorials for the cvb file example (which runs fine) in the YAMBO_TUTORIAL directory.

For 16 processors, I am using the following

X_all_q_ROLEs= "q k c v" # [PARALLEL] CPUs roles (q,k,c,v)
X_all_q_CPU= "1 1 4 4" # [PARALLEL] CPUs for each role
SE_ROLEs= "q qp b" # [PARALLEL] CPUs roles (q,qp,b)
SE_CPU= "1 1 16" # [PARALLEL] CPUs for each role
X_all_q_nCPU_invert=0 # [PARALLEL] CPUs for matrix inversion

Keeping that configuration, but changing various parameters for convergence (EXXRLvcs for example) I do not have a very high success rate for run completion, let alone actual meaningful convergence.

I don't really have a question here so much as contributing my observations to the discussion.

Version: Version 4.0.1 Revision 88
Run parameters: have run with both -S and without -S
System: MoS2

Regards
Jeff Mullen
NCSU

Posted: **Mon Aug 17, 2015 7:13 am**

Dear Jeff
Thank you very much for reporting. We will investigate this soon very deeply.
Best,
Daniele

Posted: **Mon Aug 17, 2015 10:07 am**

Dear Jeff,

thanks for reporting about this issue.
Could you please add some more details to help figuring out what is going on ?
In particular: some relevant log files, input and report files, config.log and any other info
you deem relevant

thank you
Andrea

Posted: **Mon Aug 17, 2015 11:45 am**

Dear all:

I had the same problem irrespective of use of q parallelization and I worked around this problem just by adding the line, DBsIOoff= "DIP", in the input file.
Although oscillator strengths are not stored in the databases, I think that this is one of reasonable workarounds at the moment.

Sincerely,

Posted: **Mon Aug 17, 2015 1:46 pm**

Hello,

I am attaching the LOG directory, the sequence of commands I run (cmds.sh) and the input file I am using for testing. I am sure you know this, but this file is not a converged system. I created a run with very few k points to experiment with the problem (NetCDF). This is one input file variation of about 50 I have tried over the last week - large permutation of parallelization variables. I am not and have not tried to parallelize over q, qp, etc., only the c, v, and b roles as the tutorial suggests this is the combination I want to lower the memory/processor.

I do not have the config.log as I did not compile the code, our HPC admins did. I can try to get the config.log if this doesn't help.

And finally, I did not post the SAVE directory due to its size (2.6G). If anyone requires that, and this application will allow me to upload it, I will.

Thanks,
Jeff Mullen
NCSU

Yambo Community Forum

NetCDF: Unknown file format

NetCDF: Unknown file format

Re: NetCDF: Unknown file format

Re: NetCDF: Unknown file format

Re: NetCDF: Unknown file format

Re: NetCDF: Unknown file format

Re: NetCDF: Unknown file format

Re: NetCDF: Unknown file format

Re: NetCDF: Unknown file format

Re: NetCDF: Unknown file format

Re: NetCDF: Unknown file format