Unexpected crash in the Haydock solver step

Run-time issues concerning Yambo that are not covered in the above forums.

Moderators: myrta gruning, andrea marini, Daniele Varsano, Conor Hogan

Post Reply
guilheme.vilhena
Posts: 3
Joined: Thu Jun 18, 2009 10:52 am

Unexpected crash in the Haydock solver step

Post by guilheme.vilhena » Wed Nov 18, 2009 3:22 pm

Dear developers,

I am doing a Bethe-Salpeter Calculation on a nanowire. The kohn-sham orbitals were generated using abinit.
The problem is that, for a strange reason, in the "Haydock solver" step, the calculation crashed. But ok not everything is lost, I thought, there is the restart where I have the BSEkernel. The problem is that every time I submit the calculation it crashes giving the following error in the log file:

<---> [07] BSE solver(s)
<01s> [07.01] Haydock solver
<01s> [Haydock] Use Database fragments
forrtl: severe (24): end-of-file during read, unit 0, file /dev/pts/45
Image PC Routine Line Source
yambo 000000000057443E Unknown Unknown Unknown
yambo 0000000000573498 Unknown Unknown Unknown
yambo 000000000053BCA2 Unknown Unknown Unknown
yambo 000000000050D083 Unknown Unknown Unknown
yambo 000000000050C960 Unknown Unknown Unknown
yambo 0000000000523DD6 Unknown Unknown Unknown
yambo 000000000049B828 Unknown Unknown Unknown
yambo 00000000004937C6 Unknown Unknown Unknown
yambo 000000000041BE37 Unknown Unknown Unknown
yambo 0000000000415B11 Unknown Unknown Unknown
yambo 000000000041495F Unknown Unknown Unknown
yambo 000000000040A2A9 Unknown Unknown Unknown
yambo 0000000000409ADA Unknown Unknown Unknown
yambo 0000000000406AD0 Unknown Unknown Unknown
libc.so.6 00002AFF942DD164 Unknown Unknown Unknown
yambo 0000000000405C69 Unknown Unknown Unknown

The svn version also gives the same error. I've tested this in 2 different machines JADE and Milipeia. I don't seem to find the error. Nevertheless I have a strong belief that is in the solver, because when I change it from "h" to "t" it runs (doesn't do anything, but it doesn't crash).

The attached files are:
i) the yambo-D attached file is the yambo -D result;
ii) the log file is the standard log;
iii) the file r_optics_bse_em1s_bss_04 is the report file;


Can you help me please?
cheers!
Guilherme Vilhena


Laboratoire de Physique de la Matière
Condensée et Nanostructures
Université Claude Bernard Lyon 1 et CNRS
43 boulevard du 11 novembre 1918
69622 Villeurbanne Cedex
You do not have the required permissions to view the files attached to this post.
Guilherme Vilhena
Laboratoire de Physique de la Matière
Condensée et Nanostructures
Université Claude Bernard Lyon 1 et CNRS
43 boulevard du 11 novembre 1918
69622 Villeurbanne Cedex

User avatar
Daniele Varsano
Posts: 3816
Joined: Tue Mar 17, 2009 2:23 pm
Contact:

Re: Unexpected crash in the Haydock solver step

Post by Daniele Varsano » Wed Nov 18, 2009 4:11 pm

Dear Guilherme,
it looks an input/output error as something went wrong in
the BSE kernel database.
because when I change it from "h" to "t" it runs (doesn't do anything, but it doesn't crash).
Please try to use the direct diagonalization "d", becuase "t" it is not an option for the solver.
In this way uou can see if it is a problem of Haydock. Anyway,please check that everything went ok
in the calculation of the BSE matrix.
If you had used netcdf support you could also look inside the BSE database, but from the log
file it looks you did not use them.

Let us know,

Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/

User avatar
myrta gruning
Posts: 240
Joined: Tue Mar 17, 2009 11:38 am
Contact:

Re: Unexpected crash in the Haydock solver step

Post by myrta gruning » Wed Nov 18, 2009 4:32 pm

Hallo Guilherme,

I post here below as well the input you could not attach and you sent via email (for the QpntsRXs, you can usually leave the default value, that is all q's)

I agree with Daniele. It seems that something went wrong with the calculation of the BSE matrix. The file is there but it ends before expected.

cheers
m

Code: Select all

em1s                         # [R Xs] Static Inverse Dielectric Matrix
#% QpntsRXs
# 1 | 8 |                     # [Xs] Transferred momenta -----> Mandatory, a smaller
value results in a program error
#%
% BndsRnXs
   1 | 130 |                 # [Xs] Polarization function bands
%
NGsBlkXs= 715           RL  # [Xs] Response block size
% LongDrXs
 0.000000 | 0.000000 | 1.000000 |      # [Xs] [cc] Electric Field
%

#---------------------------------------------------------------------------------------
#---------------------------------------------------------------------------------------

optics                       # [R OPT] Optics
bse                          # [R BSK] Bethe Salpeter Equation.
bss                          # [R BSS] Bethe Salpeter Equation solver
#---------------------------------------------------------------------------------------
% BEnRange
  0.00000 | 10.00000 |   eV  # [BSS] Energy range
%
% BDmRange
  0.15000 |  0.15000 |   eV  # [BSS] Damping range
%
BEnSteps= 580                      # [BSS] Energy steps
% BLongDir
 0.000000 | 0.000000 | 1.000000 |      # [BSS] [cc] Electric Field
%
#----------------------------------------------------------------------------------------
% KfnQP_E
 1.620000 | 1.000000 | 1.000000 |      # [EXTQP BSK BSS] E parameters (c/v)
%
BSresKmod= "xc"              # [BSK] Resonant Kernel mode. (`x`;`c`;`d`)
BSSmod= "h"                  # [BSS] Solvers `h/d/i/t`
#----------------------------------------------------------------------------------------
#------------ to converge ... 
----------------------------------------------------------
#----------------------------------------------------------------------------------------
BSENGBlk= 715              RL # [BSK] Screened interaction block size
FFTGvecs= 5722             RL #  [FFT] Plane waves used to describe the wavefunctions
% BSEBands
    1 | 98  |                 # [BSK] Bands range
%
BSENGexx=  1430           RL # [BSK] Exchange components
WRbsWF                #        (BSS)         Write to disk excitonic the FWs
Dr Myrta Grüning
School of Mathematics and Physics
Queen's University Belfast - Northern Ireland

http://www.researcherid.com/rid/B-1515-2009

guilheme.vilhena
Posts: 3
Joined: Thu Jun 18, 2009 10:52 am

Re: Unexpected crash in the Haydock solver step

Post by guilheme.vilhena » Wed Nov 18, 2009 6:18 pm

Thanks a lot for the fast answers!

We have tried to run using the direct diagonalization "d", and It also crashed. This time the error was the following:
<01s> [07.01] Diagonalization solverforrtl: severe (41): insufficient virtual memory
Image PC Routine Line Source
yambo 0000000000571642 Unknown Unknown Unknown
yambo 000000000057069C Unknown Unknown Unknown
yambo 000000000053816A Unknown Unknown Unknown
yambo 00000000005034EF Unknown Unknown Unknown
yambo 0000000000528A62 Unknown Unknown Unknown
yambo 000000000041C561 Unknown Unknown Unknown
yambo 000000000041864C Unknown Unknown Unknown
yambo 0000000000414F67 Unknown Unknown Unknown
yambo 000000000040A4AB Unknown Unknown Unknown
yambo 0000000000409C36 Unknown Unknown Unknown
yambo 0000000000406BD0 Unknown Unknown Unknown
libc.so.6 00002AC6DC50F164 Unknown Unknown Unknown
yambo 0000000000405D69 Unknown Unknown Unknown

Then we have tried to run this calculation on 16 cores with 30Gb and 8 processors each, but using only one cpu in each core. And It also crashed ... then using 64 cores and we have obtained the same error. So we assumed it wasn't a memory issue.

Then we noticed that if you run the calculation interactively it was idle until you hit a key (any), and then it crashed once you hit it.
Afterwards we decided to debug, to see where the problem was coming from and this (see dbg.log.txt attached file) lead us to the io_bulk.F file.
Then we added a line in the file io_bulk.F:180 print *, "OLA1", io_unit(ID) and some others print *, "OLA to see where the problem was coming from.(check io_bulk.F attached file). The standard output we got after recompiling and running the code can be found in dbg.log2.txt file. So clearly it is trying to read the files from the sterr, now we are trying to discover why.

How can I check the integrity of the files?! (we couldn't compile with netcdf)
It is an extremely heavy calculation, so reproducing the files is a very expensive test (It would take a month to have again the opportunity (available cpu time) to redo the calculations).

I've tested also tested on a smaller. From the SAVE dir I've erased the Haydock restart, and submitted the calculation. Result: It ran, nevertheless it was reading the things from io_unit(ID)=41 instead from the 0 (the situation we had in the bigger wire)

Do you have any ideas how to solve this problem

Cheers!
G.Vilhena

Laboratoire de Physique de la Matière
Condensée et Nanostructures
Université Claude Bernard Lyon 1 et CNRS
43 boulevard du 11 novembre 1918
69622 Villeurbanne Cedex
You do not have the required permissions to view the files attached to this post.
Last edited by guilheme.vilhena on Thu Nov 19, 2009 8:41 am, edited 1 time in total.
Guilherme Vilhena
Laboratoire de Physique de la Matière
Condensée et Nanostructures
Université Claude Bernard Lyon 1 et CNRS
43 boulevard du 11 novembre 1918
69622 Villeurbanne Cedex

User avatar
andrea marini
Posts: 325
Joined: Mon Mar 16, 2009 4:27 pm
Contact:

Re: Unexpected crash in the Haydock solver step

Post by andrea marini » Wed Nov 18, 2009 10:52 pm

guilheme.vilhena wrote:Thanks a lot for the fast answers!

We have tryed to run using the direct diagonalization "d", and It also crashed. This time the error was the following:
<01s> [07.01] Diagonalization solverforrtl: severe (41): insufficient virtual memory
Then we have tried to run this calculation on 16 cores with 30Gb and 8 processors each, but using only one cpu in each core. And It also crashed ... then using 64 cores and we have obtained the same error. So we assumed it wasn't a memory issue.
Virtual memory error can come from an erroneous filling of the memory, not only from an allocation problem. As pointed out from the others I am largely in favour of a partially written BS database. The BS kernel I/O and the diagonalization/Haydock is a quite stable part of the code and I do not expect to find drastic errors and bugs.
Then we noticed that if you run the calculation interactively it was idle until you hit a key (any), and then it crashed once you hit it.
On this point I do not know what is going on. I never encountered such a problem.
Afterwards we decided to debug, to see where the problem was coming from and this (see dbg.log.txt attached file) lead us to the io_bulk.F file.
Then we added a line in the file io_bulk.F:180 print *, "OLA1", io_unit(ID) and some others print *, "OLA to see where the problem was coming from.(check io_bulk.F attached file). The standard output we got after recompiling and running the code can be found in dbg.log2.txt file. So clearly it is trying to read the files from the sterr, now we are trying to discover why.
Io_unit(ID) by default is zero, not because it is trying to read from stderr but because it could not initialize the I/O assigning a unit to the Database. All this actions are doing automatically in Yambo. An ID is created and a lot of informations (like filename, if it is NETCDF, current section ...) are stored in io_UNIT(ID), io_filename(ID) ....
How can I check the integrity of the files?! (we couldn't compile with netcdf)
  • Looking at the standard log of The BS kernel run (if you have it please post it). If there were no errors then the DB should be ok. If the code stopped anywhere then the DB mybe partially written
  • I have seen in the report that Yambo fragmented the BS database. So in the SAVE folder (or somewhere elese if you used the -J option) you should see a lot of directories like BS_Q1_000ik1_000ik2/ for ik1,2=1, 32 (the k-points in your system). If you see all this directories and inside there is a non-empty db.fragment than the DB is complete. NOTE that if there are still folders to do you can restart the calculation without starting from scratch. Simply, if there is a partially written db.fragment delete the corresponding folder and lanuch yambo again. The only requirement is that there must not be holes. But as only the master cpu writes in general there should not be
It is an extremely heavy calculation, so reproducing the files is a very expensive test (It would take a month to have again the opportunity (available cpu time) to redo the calculations).
It is always good wisdom to start step by step getting familiar with the code before launching huge jobs ;) Moreover the BS size (~70000x70000) is not impressively huge, With Yambo we reached 2millionsx2millions hamiltonians :ugeek:
I've tested also tested on a smaller. From the SAVE dir I've erased the Haydock restart, and submitted the calculation. Result: It ran, nevertheless it was reading the things from io_unit(ID)=41 instead from the 0 (the situation we had in the bigger wire)
Carefull ! Yambo never reads from unit 0 ! Unit 0 means I do not found the DB. (see the routine io_reset in mod_IO.F). Units are assigned starting from 40 in io_control routine (mod_IO.F)
Andrea MARINI
Istituto di Struttura della Materia, CNR, (Italy)

guilheme.vilhena
Posts: 3
Joined: Thu Jun 18, 2009 10:52 am

Re: Unexpected crash in the Haydock solver step

Post by guilheme.vilhena » Thu Nov 19, 2009 4:19 pm

Hello!

We think we(you) have found the problem. Yambo, when started generating the kernel, didn't saved (or calculated) the first fragments.
Normally it would start from a SAVE/BS_Q1_00001_00001 and then would proceed until reaches the SAVE/BS_Q1_final_final (for example). Then, we think that it will check if the SAVE/BS_Q1_final_final was calculated and if so it would proceed to the Haydock step, where it would start loading the first fragments, and if can't find them it behaves in a odd way.

In our case, we don't know why, we didn't had the first 250 database fragments.... for a reason we aren't aware of Yambo didn't saved them. Nevertheless the calculation continued for the other fragments and it stopped as soon as It started the Haydock solver.
Then every time we restarted the calculation since It was able to find the fragment SAVE/BS_Q1_final_final It proceeded immediately to the Haydock step, where it crashed for the simple reason it couldn't find the first 250 dbs. [This is our interpretation of the problem.]

Now, if I may, one last question. If I generate the first 250 fragments of the BSkernel db can I join them to the SAVE dir where I have the others and restart the calculation (restarting from the Haydock step) ?!

Thanks to all developers!
Cheers!
Guilherme Vilhena
Laboratoire de Physique de la Matière
Condensée et Nanostructures
Université Claude Bernard Lyon 1 et CNRS
43 boulevard du 11 novembre 1918
69622 Villeurbanne Cedex

User avatar
andrea marini
Posts: 325
Joined: Mon Mar 16, 2009 4:27 pm
Contact:

Re: Unexpected crash in the Haydock solver step

Post by andrea marini » Thu Nov 19, 2009 4:32 pm

guilheme.vilhena wrote: Now, if I may, one last question. If I generate the first 250 fragments of the BSkernel db can I join them to the SAVE dir where I have the others and restart the calculation (restarting from the Haydock step) ?!
Yes, you can.... thanks to the fact that for some reason I did not include in the fragments the serial number that yambo uses to identity any database. Any BS database of Yambo comes with a random serial-number that is used to identity the solver databases. If you rewrite for some reason the kernel the solver databases will be rewritten from scratch in order to get avoid results.

Now the serial number of the BS kernel is written in the head only db.BS_Q1 and not in the fragments. So you if you recalculate the head and only some of the fragments the code should not complain. Of course you must use the same variables otherwise the blocks of the Hamiltonian may have different sizes and the I/O could fail.
Andrea MARINI
Istituto di Struttura della Materia, CNR, (Italy)

Post Reply