netCDF problem with rev 14 : LFS support absent?

Various technical topics such as parallelism and efficiency, netCDF problems, the Yambo code structure itself, are posted here.

Moderators: Davide Sangalli, andrea.ferretti, myrta gruning, andrea marini, Daniele Varsano, Conor Hogan, Nicola Spallanzani

marco.govoni
Posts: 35
Joined: Thu May 21, 2009 3:46 pm

netCDF problem with rev 14 : LFS support absent?

Post by marco.govoni » Thu Nov 11, 2010 9:21 am

Dear all,

I think I have focussed the problem of netcdf with rev 14. By increasing the number of k-points, files get larger and larger. I found this error during (actually at the end of) the setup run:

Code: Select all

 <03h-20m-02s> SE indexes |###############     | [075%] 05m-33s(E) 07m-24s(X)
 <03h-20m-24s> SE indexes |################    | [080%] 05m-55s(E) 07m-24s(X)
 <03h-20m-47s> SE indexes |#################   | [085%] 06m-18s(E) 07m-25s(X)
 <03h-21m-11s> SE indexes |##################  | [090%] 06m-42s(E) 07m-26s(X)
 <03h-21m-34s> SE indexes |################### | [095%] 07m-05s(E) 07m-27s(X)


 <03h-21m-57s> SE indexes |####################| [100%] 07m-28s(E) 07m-28s(X)
[ERROR] STOP signal received while in :[03] Transferred momenta grid
[ERROR][NetCDF] NetCDF: One or more variable sizes violate format constraints
The problem occurs when YAMBO tries to store data on ndb.kindx. As long as such file keeps its dimension < 2Gb the problem does not occur, but when it becomes larger than 2GB, i.e. for very fine kgrids -> NetCDF error.
I put a lot of prints to analize the origin of the error and I found that YAMBO (rev14) correctly ends all loops required by the setup run without errors.
But when I tried to use tools provided by netcdf I discovered that:

Code: Select all

-bash-3.2$ ncdump -k ndb.kindx 
classic
-bash-3.2$ od -An -c -N4 ndb.kindx 
           C   D   F 001
which indicates that Large File support is absent, according to http://www.unidata.ucar.edu/software/ne ... 20Support5 .
Note that, instead, I compiled rev14 with --enable-largedb as you can see from this part of my config.log:

Code: Select all

(...)
enable_debug='yes'
enable_dp='no'
enable_largedb='yes'
(...)
In order to go further, I put this prints in the criminal subroutine, i.e. src/modules/mod_IO.F

Code: Select all

(...)
#if defined _NETCDF_IO
         !
         ! Setting NF90_64BIT_OFFSET causes netCDF to create a 64-bit
         ! offset format file, instead of a netCDF classic format file.
         ! The 64-bit offset format imposes far fewer restrictions on very large
         ! (i.e. over 2 GB) data files. See Large File Support.
         !
         ! http://www.unidata.ucar.edu/software/netcdf/docs/netcdf/Large-File-Support.html
         ! http://www.unidata.ucar.edu/software/netcdf/faq-lfs.html
         !
         CREATE_MODE=nf90_share
         if ( present(ENABLE_LARGE_FILE)) CREATE_MODE=ior(nf90_share,nf90_64bit_offset)
         if ( present(ENABLE_LARGE_FILE)) then
           print*,'present ENABLE_LARGE_FILE', ENABLE_LARGE_FILE, desc
         else
           print*, 'not present ENABLE_LARGE_FILE'
         endif
         !
         if ( (io_action(ID)==OP_APP_WR_CL.or.io_action(ID)==OP_APP) ) then
           !
           if( file_exists(trim(io_file(ID))) ) then
             call netcdf_call(nf90_open(trim(io_file(ID)),&
&                             ior(nf90_write,nf90_share),io_unit(ID)))
           else
             call netcdf_call(nf90_create(trim(io_file(ID)),CREATE_MODE,io_unit(ID)))
             call netcdf_call(nf90_enddef(io_unit(ID)))
             if (io_action(ID)==OP_APP_WR_CL) io_action(ID)=OP_WR_CL
             if (io_action(ID)==OP_APP) io_action(ID)=OP_WR
           endif
           !
         else
           !
           call netcdf_call(nf90_create(trim(io_file(ID)),CREATE_MODE,io_unit(ID)))
           call netcdf_call(nf90_enddef(io_unit(ID)))
           !
         endif
#endif
(...)
The result is of course that the optional flag is not present when io_connect is called, indeed if you look at src/io/ioQINDX.F, the io_connect subroutine is called without specific declaration of the logical ENABLE_LARGE_FILE

Code: Select all

 ioQINDX=io_connect(desc='kindx',type=1,ID=io_db)
So, as far as I have understood, there is no correlation between the flag --enable-largedb and this part of the code. As a consequence I can run only simulations that lead to files < 2Gb, otherwise my simulations are stopped with such error.

Hope to have been clear and to be useful to let you resolve the problem.

Cheers!

Marco
Marco Govoni
Physics Department, University of Modena and Reggio Emilia (Italy)

marco.govoni
Posts: 35
Joined: Thu May 21, 2009 3:46 pm

Re: netCDF problem with rev 14 : LFS support absent?

Post by marco.govoni » Thu Nov 11, 2010 1:53 pm

By setting manually ENABLE_LARGE_FILES=.true. in subroutine src/modules/mod_IO.F, I get:

Code: Select all

-bash-3.2$ ncdump -k ndb.kindx 
64-bit offset
-bash-3.2$ od -An -c -N4 ndb.kindx
           C   D   F 002
which should identify a correct LargeFileSupport.
With this trick the NetCDF error disappears, however 'Game&Summary' is not written in l_setup and r_setup is not properly closed.

Solutions?

Marco
Marco Govoni
Physics Department, University of Modena and Reggio Emilia (Italy)

User avatar
andrea marini
Posts: 325
Joined: Mon Mar 16, 2009 4:27 pm
Contact:

Re: netCDF problem with rev 14 : LFS support absent?

Post by andrea marini » Fri Nov 12, 2010 2:47 pm

[quote="marco.govoni"
Solutions?
[/quote]

I am working on it. I will upload a new rev shortly. By a short-cut do no set by hand the ENABLE_LARGE_FILES because it is an optional variable. This messes up the memory and you can even get a seg fault as a result.

Instead pass the ENABLE_LARGE_FILE=.TRUE. in the argument of io_connect for the databases that exceed the 2Gb limit. In your case you need to pass ENABLE_LARGE_FILE=.TRUE. to line 36 of ioQINDX.F. Tell me if it works in the meantime I will seacrh a more "elegant" solution.

Andrea
Andrea MARINI
Istituto di Struttura della Materia, CNR, (Italy)

marco.govoni
Posts: 35
Joined: Thu May 21, 2009 3:46 pm

Re: netCDF problem with rev 14 : LFS support absent?

Post by marco.govoni » Fri Nov 12, 2010 3:42 pm

Actually I did something different:
In src/modules/mod_IO.F I forced the line

Code: Select all

CREATE_MODE=nf90_share
to be

Code: Select all

CREATE_MODE=ior(nf90_share,nf90_64bit_offset)
with this modification I got the problems written in my second post. In this way, however, io_connect always uses the 64bit_offset , not only when called by ioQINDX .
I am trying also to follow your suggestion, but I don't know if I can get the result before sp6 will be shut down (today @ 17:00). I'll let you know.

Marco
Marco Govoni
Physics Department, University of Modena and Reggio Emilia (Italy)

marco.govoni
Posts: 35
Joined: Thu May 21, 2009 3:46 pm

Re: netCDF problem with rev 14 : LFS support absent?

Post by marco.govoni » Mon Nov 15, 2010 9:45 am

No, by simply calling

Code: Select all

ioQINDX=io_connect(desc='kindx',type=1,ENABLE_LARGE_FILE=.true.,ID=io_db)
I get

Code: Select all

(...)
 <03h-14m-44s> SE indexes |                    | [000%] --(E) --(X)
 <03h-15m-05s> SE indexes |#                   | [005%] 20s(E) 06m-53s(X)
 <03h-15m-26s> SE indexes |##                  | [010%] 41s(E) 06m-56s(X)
 <03h-15m-47s> SE indexes |###                 | [015%] 01m-03s(E) 07m-02s(X)
 <03h-16m-10s> SE indexes |####                | [020%] 01m-25s(E) 07m-09s(X)
 <03h-16m-31s> SE indexes |#####               | [025%] 01m-47s(E) 07m-09s(X)
 <03h-16m-53s> SE indexes |######              | [030%] 02m-09s(E) 07m-10s(X)
 <03h-17m-15s> SE indexes |#######             | [035%] 02m-31s(E) 07m-12s(X)
 <03h-17m-38s> SE indexes |########            | [040%] 02m-53s(E) 07m-14s(X)
 <03h-17m-59s> SE indexes |#########           | [045%] 03m-15s(E) 07m-14s(X)
 <03h-18m-21s> SE indexes |##########          | [050%] 03m-37s(E) 07m-15s(X)
 <03h-18m-44s> SE indexes |###########         | [055%] 04m-00s(E) 07m-16s(X)
 <03h-19m-06s> SE indexes |############        | [060%] 04m-22s(E) 07m-16s(X)
 <03h-19m-29s> SE indexes |#############       | [065%] 04m-44s(E) 07m-17s(X)
 <03h-19m-51s> SE indexes |##############      | [070%] 05m-06s(E) 07m-18s(X)
 <03h-20m-13s> SE indexes |###############     | [075%] 05m-29s(E) 07m-19s(X)
 <03h-20m-36s> SE indexes |################    | [080%] 05m-51s(E) 07m-19s(X)
 <03h-20m-58s> SE indexes |#################   | [085%] 06m-14s(E) 07m-20s(X)
 <03h-21m-21s> SE indexes |##################  | [090%] 06m-37s(E) 07m-21s(X)
 <03h-21m-44s> SE indexes |################### | [095%] 07m-00s(E) 07m-22s(X)
no netcdf error occurs, however the 'setup run' immediately stops @ 95% without producing r_setup nor any other kind of error.
A question: do you think that the special 64offset should be present in the whole database or just in those files that exceed 2Gb (in my case ndb.kindx)? For my database everything is in classic except ndb.kindx. Might there be a problem with database mismatch? This is just a guess, I have found no report about it.

Marco
Marco Govoni
Physics Department, University of Modena and Reggio Emilia (Italy)

User avatar
andrea marini
Posts: 325
Joined: Mon Mar 16, 2009 4:27 pm
Contact:

Re: netCDF problem with rev 14 : LFS support absent?

Post by andrea marini » Mon Nov 15, 2010 9:51 am

Marco, I have tried to fix the LF support in the latest revision (rev. 15). To activate it use --enable-netcdf-LFS=yes. Now mostly all databases are written using the LFS when this flag is activated. Not all databases but this is easy to change in mod_IO.F as there is the list there.

Can you try it and report and problem ?

Andrea
Andrea MARINI
Istituto di Struttura della Materia, CNR, (Italy)

marco.govoni
Posts: 35
Joined: Thu May 21, 2009 3:46 pm

Re: netCDF problem with rev 14 : LFS support absent?

Post by marco.govoni » Mon Nov 15, 2010 10:02 am

I try.
Thanks
Marco
Marco Govoni
Physics Department, University of Modena and Reggio Emilia (Italy)

marco.govoni
Posts: 35
Joined: Thu May 21, 2009 3:46 pm

Re: netCDF problem with rev 14 : LFS support absent?

Post by marco.govoni » Mon Nov 15, 2010 3:13 pm

No, I still have that problem with rev 16.

Code: Select all

(...)
 <03h-12m-54s> SE indexes |##########          | [050%] 03m-36s(E) 07m-12s(X)
 <03h-13m-16s> SE indexes |###########         | [055%] 03m-59s(E) 07m-14s(X)
 <03h-13m-38s> SE indexes |############        | [060%] 04m-20s(E) 07m-13s(X)
 <03h-14m-01s> SE indexes |#############       | [065%] 04m-43s(E) 07m-16s(X)
 <03h-14m-23s> SE indexes |##############      | [070%] 05m-05s(E) 07m-16s(X)
 <03h-14m-41s> SE indexes |###############     | [075%] 05m-24s(E) 07m-12s(X)
 <03h-15m-05s> SE indexes |################    | [080%] 05m-47s(E) 07m-14s(X)
 <03h-15m-26s> SE indexes |#################   | [085%] 06m-08s(E) 07m-14s(X)
 <03h-15m-47s> SE indexes |##################  | [090%] 06m-29s(E) 07m-12s(X)
 <03h-16m-08s> SE indexes |################### | [095%] 06m-51s(E) 07m-12s(X)
r_setup is not created.
To let you know where it quits: I put a lot of prints in rev 14 in src/io/ioQINDX.F

Code: Select all

integer function ioQINDX(Xk,q,io_db)
 !
 use R_lattice,      ONLY:nqibz,nqbz,qindx_X,qindx_B,qindx_S,&
&                         bse_scattering,qp_states_k,nXkibz,qindx_alloc,&
&                         Xk_grid_is_uniform,bz_samp,nXkbz
 use IO_m,           ONLY:io_connect,io_disconnect,io_sec,&
&                         io_elemental,io_status,io_bulk,read_is_on,io_header,&
&                         ver_is_gt_or_eq
 implicit none
 type(bz_samp)::q,Xk
 integer      ::io_db
 !
 ! Work Space
 !
 print*, 'calling connect'
 ioQINDX=io_connect(desc='kindx',type=1,ENABLE_LARGE_FILE=.true.,ID=io_db)
 print*, 'called connect'
 if (ioQINDX/=0) goto 1
 !
 print*, 'calling 1'
 if (any((/io_sec(io_db,:)==1/))) then
   !
   ioQINDX=io_header(io_db,IMPOSE_SN=.true.)
   print*, 'called io_header'
   !
   ! In V. 3.0.7 a real parameter (RL_v_comp_norm) has been removed
   !
   if (.not.ver_is_gt_or_eq(io_db,(/3,0,8/))) ioQINDX=-1
   if (ioQINDX/=0) goto 1
   !
   call io_elemental(io_db,VAR="PARS",VAR_SZ=8)
   print*, 'called io_elemental'
   !
   call io_elemental(io_db,I0=nXkbz)
   print*, 'called io_elemental'
   call io_elemental(io_db,&
&       VAR=" Polarization last K   :",I0=nXkibz)
   print*, 'called io_elemental'
   call io_elemental(io_db,&
&       VAR=" QP states             :",I1=qp_states_k,CHECK=.true.,OP=(/">=","<="/))
   print*, 'called io_elemental'
   call io_elemental(io_db,I0=q%nibz)
   print*, 'called io_elemental'
   call io_elemental(io_db,I0=q%nbz)
   print*, 'called io_elemental'
   call io_elemental(io_db,&
&       VAR=" X grid is uniform     :",L0=Xk_grid_is_uniform)
   print*, 'called io_elemental'
   call io_elemental(io_db,&
&       VAR=" BS scattering         :",L0=bse_scattering,CHECK=.true.,OP=(/"=="/))
   print*, 'called io_elemental'
   call io_elemental(io_db,VAR="",VAR_SZ=0)
   print*, 'called io_elemental'
   ioQINDX=io_status(io_db)
   nqbz=q%nbz
   nqibz=q%nibz
   !
   if (ioQINDX/=0.or..not.any((/io_sec(io_db,:)>1/))) goto 1
 endif
 print*, 'called 1'
 !
 print*, 'calling 2'
 if (any((/io_sec(io_db,:)==2/))) then
   if (read_is_on(io_db)) allocate(q%pt(q%nibz,3))
   call io_bulk(io_db,VAR="Qpts",VAR_SZ=shape(q%pt))
   print*, 'called io_bulk'
   call io_bulk(io_db,R2=q%pt)
   print*, 'called io_bulk'
 endif
 print*, 'called 2' 
 !
 ! qindx_X(nqibz,nXkbz,2)
 ! qindx_S(qp_states_k(2),nqbz,2)
 ! (bse_scattering) -> qindx_B(nXkbz,nXkbz,2)
 !
 print*, 'calling 3'
 if (any((/io_sec(io_db,:)==3/))) then
   if (read_is_on(io_db)) call qindx_alloc()
   call io_bulk(io_db,VAR="Qindx",VAR_SZ=shape(qindx_X))
   print*, 'called io_bulk'
   call io_bulk(io_db,I3=qindx_X)
   print*, 'called io_bulk'
   if (Xk_grid_is_uniform) then
     call io_bulk(io_db,VAR="Sindx",VAR_SZ=shape(qindx_S))
   print*, 'called io_bulk'
     call io_bulk(io_db,I3=qindx_S)
   print*, 'called io_bulk'
   endif
   if (bse_scattering) then
     call io_bulk(io_db,VAR="Bindx",VAR_SZ=shape(qindx_B))
   print*, 'called io_bulk'
     call io_bulk(io_db,I3=qindx_B)
   print*, 'called io_bulk'
   endif
 endif
 print*, 'called 3'
 !
 print*, 'calling disconnect'
1 call io_disconnect(ID=io_db)
 print*, 'called disconnect'
 !
end function
and this is the result

Code: Select all

 calling connect
 called connect
 calling 1
 called io_header
 called io_elemental
 called io_elemental
 called io_elemental
 called io_elemental
 called io_elemental
 called io_elemental
 called io_elemental
 called io_elemental
 called io_elemental
 called 1
 calling 2
 called io_bulk
 called io_bulk
 called 2
 calling 3
 called io_bulk
 called io_bulk
after the second io_bulk call of block 3 the program stops. The routine ioQINDX is never terminated correctly. I suspect the same ending point also in rev 16, if you need those prints I can do the same for rev 16.
Tell me if you need more info.

Marco
Marco Govoni
Physics Department, University of Modena and Reggio Emilia (Italy)

User avatar
andrea marini
Posts: 325
Joined: Mon Mar 16, 2009 4:27 pm
Contact:

Re: netCDF problem with rev 14 : LFS support absent?

Post by andrea marini » Mon Nov 15, 2010 3:27 pm

Dear Marco, if the database if correctly created with the LFS (check with the od program) than it is hard for me to help you further. Can you provide me the input files/databases to reproduce the error on my Linux box ? If I can reproduce it there I can fix it, otherwise I need to gain access to the machine where you're running.

Andrea
Andrea MARINI
Istituto di Struttura della Materia, CNR, (Italy)

marco.govoni
Posts: 35
Joined: Thu May 21, 2009 3:46 pm

Re: netCDF problem with rev 14 : LFS support absent?

Post by marco.govoni » Mon Nov 15, 2010 3:44 pm

No problem to give you the inputs, or directly a tar of the SAVE directory right after the p2y.
However I have to warn you that each run take almost 4h (just 1 cpu because this part of the code is serial).

Marco
Marco Govoni
Physics Department, University of Modena and Reggio Emilia (Italy)

Post Reply