Variable BSE_RESONANT; NetCDF: HDF error

Deals with issues related to computation of optical spectra in reciprocal space: RPA, TDDFT, local field effects.

Moderators: Davide Sangalli, andrea.ferretti, myrta gruning, andrea marini, Daniele Varsano, Conor Hogan

milesj
Posts: 29
Joined: Thu Jan 26, 2023 9:27 pm

Variable BSE_RESONANT; NetCDF: HDF error

Post by milesj » Thu Mar 09, 2023 9:41 am

Hi all,

I keep receiving the same error

Code: Select all

Writing File ./conv_k//ndb.BS_Q1_CPU_0; Variable BSE_RESONANT; NetCDF: HDF error
I've attached my input file, as well as my setup and BSE log files. I'm trying to converge a computation with respect to the k-grid, and this error only seems to occur as I go to larger k-grids. This exact computation with the same input file finished for a 662, 666, and 884 grid, but keeps failing with this error for the 888 grid. I am using OpenMP parallelization, with 64 processes on a single node setting

Code: Select all

OMP_NUM_THREADS=64
When I check the node during the computation, there don't seem to be any memory issues, and the ndb.BS_Q1_CPU_0 file in the competed 884 grid computation is only 4 gigabytes, so I can't imagine it's a problem with file size. I'm really at a loss for how to continue at this point, so any advice is appreciated.

Thanks,
Miles
You do not have the required permissions to view the files attached to this post.
Miles Johnson
California Institute of Technology
PhD candidate in Applied Physics

User avatar
Davide Sangalli
Posts: 610
Joined: Tue May 29, 2012 4:49 pm
Location: Via Salaria Km 29.3, CP 10, 00016, Monterotondo Stazione, Italy
Contact:

Re: Variable BSE_RESONANT; NetCDF: HDF error

Post by Davide Sangalli » Thu Mar 09, 2023 10:13 am

Dear Miles,
can you try compiling yambo with parallel I/O ?
It is the default now. I guess you compiled with

Code: Select all

--disable-hdf5-par-io
Please don't use this / compile with

Code: Select all

--enable-hdf5-par-io
It might solve the issue. You would need to re-run the simulation.

Best,
D.
Davide Sangalli, PhD
CNR-ISM, Division of Ultrafast Processes in Materials (FLASHit) and MaX Centre
https://sites.google.com/view/davidesangalli
http://www.max-centre.eu/

milesj
Posts: 29
Joined: Thu Jan 26, 2023 9:27 pm

Re: Variable BSE_RESONANT; NetCDF: HDF error

Post by milesj » Thu Mar 09, 2023 8:00 pm

Hi Davide,

This is the configure call I used to compile:

Code: Select all

 ./configure  --enable-open-mp --with-libxc-libdir=/opt/apps/libxc-5.2.3_intel-2021.7.1/lib/ --with-libxc-includedir=/opt/apps/libxc-5.2.3_intel-2021.7.1/include/ --with-hdf5-libdir=/opt/apps/hdf5-1.12.2_intel-2021.7.1/lib/ --with-hdf5-includedir=/opt/apps/hdf5-1.12.2_intel-2021.7.1/include/
I've also tried it with just

Code: Select all

 ./configure  --enable-open-mp --with-libxc-libdir=/opt/apps/libxc-5.2.3_intel-2021.7.1/lib/ --with-libxc-includedir=/opt/apps/libxc-5.2.3_intel-2021.7.1/include/
but this gives the same error.

I will try adding --enable-hdf5-par-io and let you know if it helps.

Thanks,
Miles
Miles Johnson
California Institute of Technology
PhD candidate in Applied Physics

User avatar
Davide Sangalli
Posts: 610
Joined: Tue May 29, 2012 4:49 pm
Location: Via Salaria Km 29.3, CP 10, 00016, Monterotondo Stazione, Italy
Contact:

Re: Variable BSE_RESONANT; NetCDF: HDF error

Post by Davide Sangalli » Thu Mar 09, 2023 11:47 pm

Can you attach the file:

Code: Select all

config/report

It is generated after you run the configure

Best,
D.
Davide Sangalli, PhD
CNR-ISM, Division of Ultrafast Processes in Materials (FLASHit) and MaX Centre
https://sites.google.com/view/davidesangalli
http://www.max-centre.eu/

milesj
Posts: 29
Joined: Thu Jan 26, 2023 9:27 pm

Re: Variable BSE_RESONANT; NetCDF: HDF error

Post by milesj » Fri Mar 10, 2023 8:50 pm

Hi Daniele,

Compiling with --enable-hdf5-par-io did not seem to change anything. I've attached the corresponding report file.

I should also mention that, if I instead run with

Code: Select all

OMP_NUM_THREADS=16
, and specify -n 4:

Code: Select all

srun -n 4 yambo -F 888.in -J conv_k
(i.e. hybrid parallelization with 4 MPI tasks, and 16 OpenMP tasks per MPI tasks, all still on one node), the computation completes the main kernel loop (as you can see in the attached MPI4_OpenMP16_log file), but I get a segmentation fault error at the diagonalization step. With the 8x8x8 grid, this same segmentation fault error seems to appear whenever I try to use hybrid parallelization (for example, I have tried with OMP_NUM_THREADS=64 but on two nodes so with -n 2, and I get the same segmentation fault error). For this reason I've thought of trying to split the computation so the main kernel loop is calculated with multiple MPI tasks, then just one for the diagonalization, but the netcdf error still appears when I try to run the diagonalization separately. Unfortunately due to the size of the problem I can't afford to run without parallelization (8x8x8 itself would take way too long and I doubt the computation will be close to converged by 8x8x8 anyways), so I think I need to figure out why I'm getting both of these errors.

Thanks for all the help!
-Miles
You do not have the required permissions to view the files attached to this post.
Miles Johnson
California Institute of Technology
PhD candidate in Applied Physics

User avatar
Davide Sangalli
Posts: 610
Joined: Tue May 29, 2012 4:49 pm
Location: Via Salaria Km 29.3, CP 10, 00016, Monterotondo Stazione, Italy
Contact:

Re: Variable BSE_RESONANT; NetCDF: HDF error

Post by Davide Sangalli » Sun Mar 12, 2023 7:18 pm

Dear Miles,

1) from the attached report the code is compiled with serial I/O.
See this line

Code: Select all

# 
# - I/O -
#
# [-]: Parallel I/O  
The reason might be that the external HDF5 do not have parallel I/O enabled

You can try again without linking the external HDF5. It will switch to the internal ones.
After running the configure you should see

Code: Select all

# 
# - I/O -
#
# [X]: Parallel I/O  
And later you will see -D_PAR_IO or similar at the end of the configure

2) there is always the option to switch off the I/O of the BSE kernel. Set, in the input file of the BSE run, the variable

Code: Select all

DBsIOoff="BS"
This may also lead to a significant speed-up in the calculation of the BSE matrix.

3) When computing the kernel yambo distributes the BSE matrix in memory, instead when diagonalizing it, the BSE matrix is duplicated.
My advice is not to use the diago solver, but
(i) the Haydock solver for getting the spectra, (https://www.yambo-code.eu/wiki/index.ph ... os-Haydock)
(ii) the Slepc solver to get few eigenvalues and eigenvectors (https://www.yambo-code.eu/wiki/index.ph ... ver:_SLEPC)
Default slepc solver uses distributed memory. There is also a faster algorithm with duplicated memory (in case look also to the next point).
For using the slepc solver you also need to set

Code: Select all

--enable-slepc-linalg
when running the configure.

4) With "parallel I/O enabled" and "BS I/O on" you can first compute the BSE matrix with many cores (BSE matrix distributed in memory).
It will fill the file ndb.BS_PAR. Even if the core crashes during the run, you can re-start it, even changing the number of cores, and the run will resume correctly.
Once the computation of the matrix is finalized, you can restart in serial or with fewer cores ,either with the "non distributed slepc algorithm" or with diago (lapack or sclalpack).
For scalapack you would need at configure time

Code: Select all

--enable-par-linalg
Best,
D.
Davide Sangalli, PhD
CNR-ISM, Division of Ultrafast Processes in Materials (FLASHit) and MaX Centre
https://sites.google.com/view/davidesangalli
http://www.max-centre.eu/

milesj
Posts: 29
Joined: Thu Jan 26, 2023 9:27 pm

Re: Variable BSE_RESONANT; NetCDF: HDF error

Post by milesj » Sun Mar 12, 2023 8:59 pm

Hi Davide,

1) This time I called

Code: Select all

./configure  --enable-open-mp --with-libxc-libdir=/opt/apps/libxc-5.2.3_intel-2021.7.1/lib/ --with-libxc-includedir=/opt/apps/libxc-5.2.3_intel-2021.7.1/include/ --enabl
e-hdf5-par-io --enable-slepc-linalg
but the I/O section still reads

Code: Select all

# - I/O -
#
# [-]: Parallel I/O
# [C]: HDF5
# [C]: NETCDF Large Files Support enabled, Version 4
# [-]: Parallel NETCDF
# [-]: Parallel HDF5
Is it possible I'm missing something necessary for Parallel I/O? Like CUDA support (I'm not sure what that is but configure says I don't have it)?


2&3)

I've rerun the 8x8x8 computation with input file

Code: Select all

DBsIOoff="BS"
NoDiagSC
em1s                             # [R][Xs] Statically Screened Interaction
optics                           # [R] Linear Response optical properties
bss                              # [R] BSE solver
bse                              # [R][BSE] Bethe Salpeter Equation.
dipoles                          # [R] Oscillator strenghts (or dipoles)
FFTGvecs=  18997           RL    # [FFT] Plane-waves
DIP_Threads=0                    # [OPENMP/X] Number of threads for dipoles
X_Threads=0                      # [OPENMP/X] Number of threads for response functions
K_Threads=0                      # [OPENMP/BSK] Number of threads for response functions
Chimod= "HARTREE"                # [X] IP/Hartree/ALDA/LRC/PF/BSfxc
BSEmod= "resonant"               # [BSE] resonant/retarded/coupling
BSKmod= "SEX"                    # [BSE] IP/Hartree/HF/ALDA/SEX/BSfxc
BSSmod= "h"                      # [BSS] (h)aydock/(d)iagonalization/(s)lepc/(i)nversion/(t)ddft
BSENGexx= 18997            RL    # [BSK] Exchange components
BSENGBlk= 217              RL    # [BSK] Screened interaction block size [if -1 uses all the G-vectors of W(q,G,Gp)]
#WehCpl                        # [BSK] eh interaction included also in coupling
BSEprop= "abs"                   # [BSS] Can be any among abs/jdos/kerr/magn/dich/photolum/esrt
BSEdips= "none"                  # [BSS] Can be "trace/none" or "xy/xz/yz" to define off-diagonal rotation plane
% BSEQptR
 1 | 1 |                             # [BSK] Transferred momenta range
%
% BSEBands
  75 |  90 |                         # [BSK] Bands range
%
% BEnRange
  0.00000 | 2.00000 |         eV    # [BSS] Energy range
%
% BDmRange
 0.0100000 | 0.0100000 |         eV    # [BSS] Damping range
%
BEnSteps= 200                    # [BSS] Energy steps
% BLongDir
 0.000000 | 1.000000 | 1.000000 |        # [BSS] [cc] Electric Field
%
BSHayTrs=-0.020000               # [BSS] Relative [o/o] Haydock threshold. Strict(>0)/Average(<0)
% BndsRnXs
   1 | 200 |                         # [Xs] Polarization function bands
%
NGsBlkXs= 217              RL    # [Xs] Response block size
% LongDrXs
 1.000000 | 1.000000 | 1.000000 |        # [Xs] [cc] Electric Field
%
XTermKind= "none"                # [X] X terminator ("none","BG" Bruneval-Gonze)
I will let you know if it finishes.

4) I believe I cannot address this point yet, since parallel IO seems to still not be enabled.


Thanks,
Miles
Miles Johnson
California Institute of Technology
PhD candidate in Applied Physics

User avatar
Davide Sangalli
Posts: 610
Joined: Tue May 29, 2012 4:49 pm
Location: Via Salaria Km 29.3, CP 10, 00016, Monterotondo Stazione, Italy
Contact:

Re: Variable BSE_RESONANT; NetCDF: HDF error

Post by Davide Sangalli » Sun Mar 12, 2023 9:49 pm

Can you attach again the full config/report ?

I also suggest to use something like

Code: Select all

--with-extlibs-path="/home/milesj/nips3_yambo_converge/YAMBO/yambo-libs"
This way the external libraries are compiled once for all, and you do not need to re-compile them every time.

Best,
D.
Davide Sangalli, PhD
CNR-ISM, Division of Ultrafast Processes in Materials (FLASHit) and MaX Centre
https://sites.google.com/view/davidesangalli
http://www.max-centre.eu/

milesj
Posts: 29
Joined: Thu Jan 26, 2023 9:27 pm

Re: Variable BSE_RESONANT; NetCDF: HDF error

Post by milesj » Mon Mar 13, 2023 6:23 pm

Hi Davide,

I've attached the corresponding report file from my previous message.

Also, the run from point 2&3) of my previous message failed in the same way.

Thanks,
Miles
You do not have the required permissions to view the files attached to this post.
Miles Johnson
California Institute of Technology
PhD candidate in Applied Physics

User avatar
Davide Sangalli
Posts: 610
Joined: Tue May 29, 2012 4:49 pm
Location: Via Salaria Km 29.3, CP 10, 00016, Monterotondo Stazione, Italy
Contact:

Re: Variable BSE_RESONANT; NetCDF: HDF error

Post by Davide Sangalli » Tue Mar 14, 2023 9:47 am

We should figure out why it is always compiling with serial I/O.
Can you also attach the config.log file?

Then, with the I/O of the BS DB off, I guess the same error is the segmentation fault (?)
Did you try to use the haydock solver?

Best,
D.
Davide Sangalli, PhD
CNR-ISM, Division of Ultrafast Processes in Materials (FLASHit) and MaX Centre
https://sites.google.com/view/davidesangalli
http://www.max-centre.eu/

Post Reply