"[ERROR] LINEAR ALGEBRA driver" or NaN results in GW calculation

mrefiore · Post by **mrefiore** » Thu Jul 11, 2024 4:51 pm

Dear YAMBO developers and users,

I'm trying to run a G0W0 calculation with YAMBO 5.2.2 and/or 5.2.3, specifically installed on the LEONARDO supercomputer @CINECA.
When the nscf is run with few empty bands, say 500, the G0W0 calculation is successful without any problem and the output is reasonable.

When the nscf is run with more bands, e.g. 3000, if I try to run a G0W0 calculation with a number of band <=3000 I get this error message when yambo is computing chi:

[ERROR] LINEAR ALGEBRA driver [PARALLEL_lin_system]performing P(Z/C)GESV

This happens when in the input file I leave the default

X_and_IO_nCPU_LinAlg_INV=-1

If I change the value of this keyword and run LA on gpu, the code proceeds and ends. However, in this case I get NaN results in the output file

# K-point Band Eo [eV] E-Eo [eV] Sc|Eo [eV]
#
1 93 0.000000 NaN NaN
1 94 1.681785 NaN NaN

I suspect the two issues are related. I've tried with different yambo versions (5.2.2 and the newer 5.2.3) without luck.
I've attached the input, output and log files.
Thank you very much for your help!
Best,

Michele

mrefiore · Post by **mrefiore** » Thu Jul 11, 2024 6:45 pm

Dear all,

I can add more details to the problem.
By inspecting the ns.wf_fragment_* databases in the SAVE directory, it appears that some fragments contain -Infinity values. This happens ONLY when I run a Quantum ESPRESSO nscf calculation with a larger number of bands (in my case, 3000). When the nscf is performed with fewer bands (500 in my case), all ns.wf_fragment_* are fine and indeed in this case all YAMBO calculations are ok.
Always keeping in mind that I'm running on the GPU-accelerated LEONARDO cluster, I've tried to produce the SAVE with different compilations of p2y, included a non-GPU one, but the problem is always there. I've also tried p2y -b #bands as suggested in the YAMBO github page, without success. In contrast, I've always used GPU-accelerated QE versions.
Now, I'm wondering if this is a p2y problem or rather a QE issue when a "larger" number of bands is considered.
Thank you for your help!

Michele

Post by **Daniele Varsano** » Mon Jul 15, 2024 9:29 am

Dear Michele,

you are not the first reporting this issue.
Can you inspect if the NaN (Infty) are already present in the raw wfs generated by QE or only in the ns.wf*?
If they are only present in the ns.wf* we can inspect what's disturbing p2y in converting the format.

Best,
Daniele

mrefiore · Post by **mrefiore** » Mon Jul 15, 2024 10:24 am

Dear Daniele,

Thank you for your reply!
I indeed wanted to investigate that, but unfortunately I don't know how to read into QE's binary wfc#.dat files.
However, I can add that if I run the QE calculation with a NON-GPU-accelerated version, the problem vanishes. Therefore, I strongly suspect the problem lies in QE-GPU.

Michele

csk · Post by **csk** » Fri Oct 04, 2024 12:51 pm

Hi!

I can confirm the error when using the GPU build of QE (v 7.2, also on the Leonardo cluster) that when running with more bands, the wave functions contain Nan values. The error goes away when running with a CPU only version of QE.

For checking if your QE wave functions contain Nan values, you might find the following code snippet useful:

Code: Select all

import numpy as np

def read_wavefunction_k_qe_dat(dat_file):
# Credits: https://mattermodeling.stackexchange.com/questions/9149/how-to-read-qes-wfc-dat-files-with-python

    with open(dat_file, 'rb') as f:
        # Moves the cursor 4 bytes to the right
        f.seek(4)

        ik = np.fromfile(f, dtype='int32', count=1)[0]
        xk = np.fromfile(f, dtype='float64', count=3)
        ispin = np.fromfile(f, dtype='int32', count=1)[0]
        gamma_only = bool(np.fromfile(f, dtype='int32', count=1)[0])
        scalef = np.fromfile(f, dtype='float64', count=1)[0]

        # Move the cursor 8 byte to the right
        f.seek(8, 1)

        ngw = np.fromfile(f, dtype='int32', count=1)[0]
        igwx = np.fromfile(f, dtype='int32', count=1)[0]
        npol = np.fromfile(f, dtype='int32', count=1)[0]
        nbnd = np.fromfile(f, dtype='int32', count=1)[0]

        # Move the cursor 8 byte to the right
        f.seek(8, 1)

        b1 = np.fromfile(f, dtype='float64', count=3)
        b2 = np.fromfile(f, dtype='float64', count=3)
        b3 = np.fromfile(f, dtype='float64', count=3)

        f.seek(8,1)

        mill = np.fromfile(f, dtype='int32', count=3*igwx)
        mill = mill.reshape( (igwx, 3) )

        evc = np.zeros( (nbnd, npol*igwx), dtype="complex128")

        f.seek(8,1)
        for i in range(nbnd):
            evc[i,:] = np.fromfile(f, dtype='complex128', count=npol*igwx)
            f.seek(8, 1)

    return evc

kpoint = 1
wf = read_wavefunction_k_qe_dat('wfc' + str(kpoint) + '.dat')
print('WF contains NaN values: ', np.isnan(wf).any())

Do you know if this has been reported to the Quantum Espresso developers or would you assume that this is a problem of the compilation?

Cheers,
Christian

Post by **Daniele Varsano** » Fri Oct 04, 2024 4:48 pm

Dear Christian,

this problem has been reported to the QE communities and I know that it has been (at least partially) fixed by the QE developers. As far as I know, it is needed to avoid the -npools option in the QE run, but I do not know if it is needed to use a specific patched version of QE. You can maybe inquire the Leonardo user support or the QE mailing list.

Best,

Daniele

Aolei Wang · Post by **Aolei Wang** » Mon Dec 15, 2025 3:36 am

Dear all,

I am testing k-point convergence for a GW calculation (Yambo v5.3.0) and I encountered the same error:
[ERROR] STOP signal received while in[07] Dynamic Dielectric Matrix (PPA)
[ERROR] LINEAR ALGEBRA driver [PARALLEL_lin_system] performing P(Z/C)GESVXXX

The error does not appear for coarse k-grids 6×6×1, 9×9×1, 12×12×1, and 15×15×1, but it does appear starting at 18×18×1. If I set X_and_IO_nCPU_LinAlg_INV=1, the run proceeds without that error, but then all GW energy corrections also become NaN,.
I also checked the QE wavefunctions (wfc*) using the script provided by Christian and the ns.wf_fragments_* files produced by p2y after conversion — none of these files contain NaN values.

Could you help me diagnose what might be causing (a) the GESVXXX linear algebra error at higher k-point density, and (b) the NaN GW corrections when forcing X_and_IO_nCPU_LinAlg_INV=1? Any suggestions would be greatly appreciated.

I have attached the relevant input, output files for reference. Thank you very much!

Best regards,
Aolei Wang
Department of Physics & Astronomy
California State University, Northridge

Post by **Daniele Varsano** » Mon Dec 15, 2025 8:49 am

Dear Aolei,

actually, it is not straightforward to spot the problem.
Regarding the second run (serial linear algebra|) I notice from the report that yambo is reading some previously calculated plasmon pole databases. You can try to remove all the ./bse/ndb.pp* database and rerun your calculation. Hopefully this will solve the problem.
About the failure when using parallel linear algebra, at moment I do not have a clue.

Best,
Daniele

PS: Note that setting FFTGvecs to a low value (10Ry) could be a source of problem.

Aolei Wang · Post by **Aolei Wang** » Wed Dec 17, 2025 2:13 am

Dear Daniele,

Thank you very much for the quick reply — I agree it’s not straightforward. Following your suggestions I made the following changes and re-ran the job: Increased FFTGvecs, EXXRLvcs, and VXCRLvcs from 10 Ry to 30 Ry and removed all ./bse/ndb.pp* databases. After this the run completes but the GW energy corrections are still NaN. I attach the modified input and the output/log files from this attempt for your reference.

Could you please advise what other diagnostics or experiments? I’m happy to run any targeted tests you suggest and can provide any additional logs. Thanks again for looking — any pointers on the next steps would be greatly appreciated.

Best,
Aolei

Post by **Daniele Varsano** » Wed Dec 17, 2025 10:24 am

Dear Aolei,
actually, it is not easy to spot the problem.
Probably we should reproduce your error to investigate it deeply.

Before doing that, I suggest you to set up a more handy calculations doing the following.

1) consider to calculate just one or two Qps setting e.g.

Code: Select all

%QPkrange                        # [GW] QP generalized Kpoint/Band indices
  1|1|83|84|
 %

this will speed up a lot the calculation, still reproducing the error as you have NAN for all the self energy calculations.

2) Exploit symmetries by slightly modify your atom positions setting them in symmetric points e.g.:

Code: Select all

In1       :   3.8602   2.22870  24.33920  
  In2       :   7.72048   4.4574  32.29734  
  In3       :   3.8602   2.22870  42.89197  
  In4       :    0    0    50.8602     
  Se1       :  0  0    21.9889    
  Se2       :   3.8602  2.22870  29.13587  
  Se3       :  0 0    34.8073     
  Se4       :   7.72048   4.4574  40.55669  
  Se5       :   3.8602   2.2287   47.68916  
  Se6       :   7.72048   4.4574  53.3784

this will require to repeat your scf/nscf calculations but symmetries will be spotted by QE and Yambo reducing the grid of q points in the IBZ.

Best,

Daniele

Yambo Community Forum

"[ERROR] LINEAR ALGEBRA driver" or NaN results in GW calculation

"[ERROR] LINEAR ALGEBRA driver" or NaN results in GW calculation

Re: "[ERROR] LINEAR ALGEBRA driver" or NaN results in GW calculation

Re: "[ERROR] LINEAR ALGEBRA driver" or NaN results in GW calculation

Re: "[ERROR] LINEAR ALGEBRA driver" or NaN results in GW calculation

Re: "[ERROR] LINEAR ALGEBRA driver" or NaN results in GW calculation

Re: "[ERROR] LINEAR ALGEBRA driver" or NaN results in GW calculation

Re: "[ERROR] LINEAR ALGEBRA driver" or NaN results in GW calculation

Re: "[ERROR] LINEAR ALGEBRA driver" or NaN results in GW calculation

Re: "[ERROR] LINEAR ALGEBRA driver" or NaN results in GW calculation

Re: "[ERROR] LINEAR ALGEBRA driver" or NaN results in GW calculation