Page 1 of 2

netCDF problem with rev 14 : LFS support absent?

Posted: Thu Nov 11, 2010 9:21 am
by marco.govoni
Dear all,

I think I have focussed the problem of netcdf with rev 14. By increasing the number of k-points, files get larger and larger. I found this error during (actually at the end of) the setup run:

Code: Select all

 <03h-20m-02s> SE indexes |###############     | [075%] 05m-33s(E) 07m-24s(X)
 <03h-20m-24s> SE indexes |################    | [080%] 05m-55s(E) 07m-24s(X)
 <03h-20m-47s> SE indexes |#################   | [085%] 06m-18s(E) 07m-25s(X)
 <03h-21m-11s> SE indexes |##################  | [090%] 06m-42s(E) 07m-26s(X)
 <03h-21m-34s> SE indexes |################### | [095%] 07m-05s(E) 07m-27s(X)


 <03h-21m-57s> SE indexes |####################| [100%] 07m-28s(E) 07m-28s(X)
[ERROR] STOP signal received while in :[03] Transferred momenta grid
[ERROR][NetCDF] NetCDF: One or more variable sizes violate format constraints
The problem occurs when YAMBO tries to store data on ndb.kindx. As long as such file keeps its dimension < 2Gb the problem does not occur, but when it becomes larger than 2GB, i.e. for very fine kgrids -> NetCDF error.
I put a lot of prints to analize the origin of the error and I found that YAMBO (rev14) correctly ends all loops required by the setup run without errors.
But when I tried to use tools provided by netcdf I discovered that:

Code: Select all

-bash-3.2$ ncdump -k ndb.kindx 
classic
-bash-3.2$ od -An -c -N4 ndb.kindx 
           C   D   F 001
which indicates that Large File support is absent, according to http://www.unidata.ucar.edu/software/ne ... 20Support5 .
Note that, instead, I compiled rev14 with --enable-largedb as you can see from this part of my config.log:

Code: Select all

(...)
enable_debug='yes'
enable_dp='no'
enable_largedb='yes'
(...)
In order to go further, I put this prints in the criminal subroutine, i.e. src/modules/mod_IO.F

Code: Select all

(...)
#if defined _NETCDF_IO
         !
         ! Setting NF90_64BIT_OFFSET causes netCDF to create a 64-bit
         ! offset format file, instead of a netCDF classic format file.
         ! The 64-bit offset format imposes far fewer restrictions on very large
         ! (i.e. over 2 GB) data files. See Large File Support.
         !
         ! http://www.unidata.ucar.edu/software/netcdf/docs/netcdf/Large-File-Support.html
         ! http://www.unidata.ucar.edu/software/netcdf/faq-lfs.html
         !
         CREATE_MODE=nf90_share
         if ( present(ENABLE_LARGE_FILE)) CREATE_MODE=ior(nf90_share,nf90_64bit_offset)
         if ( present(ENABLE_LARGE_FILE)) then
           print*,'present ENABLE_LARGE_FILE', ENABLE_LARGE_FILE, desc
         else
           print*, 'not present ENABLE_LARGE_FILE'
         endif
         !
         if ( (io_action(ID)==OP_APP_WR_CL.or.io_action(ID)==OP_APP) ) then
           !
           if( file_exists(trim(io_file(ID))) ) then
             call netcdf_call(nf90_open(trim(io_file(ID)),&
&                             ior(nf90_write,nf90_share),io_unit(ID)))
           else
             call netcdf_call(nf90_create(trim(io_file(ID)),CREATE_MODE,io_unit(ID)))
             call netcdf_call(nf90_enddef(io_unit(ID)))
             if (io_action(ID)==OP_APP_WR_CL) io_action(ID)=OP_WR_CL
             if (io_action(ID)==OP_APP) io_action(ID)=OP_WR
           endif
           !
         else
           !
           call netcdf_call(nf90_create(trim(io_file(ID)),CREATE_MODE,io_unit(ID)))
           call netcdf_call(nf90_enddef(io_unit(ID)))
           !
         endif
#endif
(...)
The result is of course that the optional flag is not present when io_connect is called, indeed if you look at src/io/ioQINDX.F, the io_connect subroutine is called without specific declaration of the logical ENABLE_LARGE_FILE

Code: Select all

 ioQINDX=io_connect(desc='kindx',type=1,ID=io_db)
So, as far as I have understood, there is no correlation between the flag --enable-largedb and this part of the code. As a consequence I can run only simulations that lead to files < 2Gb, otherwise my simulations are stopped with such error.

Hope to have been clear and to be useful to let you resolve the problem.

Cheers!

Marco

Re: netCDF problem with rev 14 : LFS support absent?

Posted: Thu Nov 11, 2010 1:53 pm
by marco.govoni
By setting manually ENABLE_LARGE_FILES=.true. in subroutine src/modules/mod_IO.F, I get:

Code: Select all

-bash-3.2$ ncdump -k ndb.kindx 
64-bit offset
-bash-3.2$ od -An -c -N4 ndb.kindx
           C   D   F 002
which should identify a correct LargeFileSupport.
With this trick the NetCDF error disappears, however 'Game&Summary' is not written in l_setup and r_setup is not properly closed.

Solutions?

Marco

Re: netCDF problem with rev 14 : LFS support absent?

Posted: Fri Nov 12, 2010 2:47 pm
by andrea marini
[quote="marco.govoni"
Solutions?
[/quote]

I am working on it. I will upload a new rev shortly. By a short-cut do no set by hand the ENABLE_LARGE_FILES because it is an optional variable. This messes up the memory and you can even get a seg fault as a result.

Instead pass the ENABLE_LARGE_FILE=.TRUE. in the argument of io_connect for the databases that exceed the 2Gb limit. In your case you need to pass ENABLE_LARGE_FILE=.TRUE. to line 36 of ioQINDX.F. Tell me if it works in the meantime I will seacrh a more "elegant" solution.

Andrea

Re: netCDF problem with rev 14 : LFS support absent?

Posted: Fri Nov 12, 2010 3:42 pm
by marco.govoni
Actually I did something different:
In src/modules/mod_IO.F I forced the line

Code: Select all

CREATE_MODE=nf90_share
to be

Code: Select all

CREATE_MODE=ior(nf90_share,nf90_64bit_offset)
with this modification I got the problems written in my second post. In this way, however, io_connect always uses the 64bit_offset , not only when called by ioQINDX .
I am trying also to follow your suggestion, but I don't know if I can get the result before sp6 will be shut down (today @ 17:00). I'll let you know.

Marco

Re: netCDF problem with rev 14 : LFS support absent?

Posted: Mon Nov 15, 2010 9:45 am
by marco.govoni
No, by simply calling

Code: Select all

ioQINDX=io_connect(desc='kindx',type=1,ENABLE_LARGE_FILE=.true.,ID=io_db)
I get

Code: Select all

(...)
 <03h-14m-44s> SE indexes |                    | [000%] --(E) --(X)
 <03h-15m-05s> SE indexes |#                   | [005%] 20s(E) 06m-53s(X)
 <03h-15m-26s> SE indexes |##                  | [010%] 41s(E) 06m-56s(X)
 <03h-15m-47s> SE indexes |###                 | [015%] 01m-03s(E) 07m-02s(X)
 <03h-16m-10s> SE indexes |####                | [020%] 01m-25s(E) 07m-09s(X)
 <03h-16m-31s> SE indexes |#####               | [025%] 01m-47s(E) 07m-09s(X)
 <03h-16m-53s> SE indexes |######              | [030%] 02m-09s(E) 07m-10s(X)
 <03h-17m-15s> SE indexes |#######             | [035%] 02m-31s(E) 07m-12s(X)
 <03h-17m-38s> SE indexes |########            | [040%] 02m-53s(E) 07m-14s(X)
 <03h-17m-59s> SE indexes |#########           | [045%] 03m-15s(E) 07m-14s(X)
 <03h-18m-21s> SE indexes |##########          | [050%] 03m-37s(E) 07m-15s(X)
 <03h-18m-44s> SE indexes |###########         | [055%] 04m-00s(E) 07m-16s(X)
 <03h-19m-06s> SE indexes |############        | [060%] 04m-22s(E) 07m-16s(X)
 <03h-19m-29s> SE indexes |#############       | [065%] 04m-44s(E) 07m-17s(X)
 <03h-19m-51s> SE indexes |##############      | [070%] 05m-06s(E) 07m-18s(X)
 <03h-20m-13s> SE indexes |###############     | [075%] 05m-29s(E) 07m-19s(X)
 <03h-20m-36s> SE indexes |################    | [080%] 05m-51s(E) 07m-19s(X)
 <03h-20m-58s> SE indexes |#################   | [085%] 06m-14s(E) 07m-20s(X)
 <03h-21m-21s> SE indexes |##################  | [090%] 06m-37s(E) 07m-21s(X)
 <03h-21m-44s> SE indexes |################### | [095%] 07m-00s(E) 07m-22s(X)
no netcdf error occurs, however the 'setup run' immediately stops @ 95% without producing r_setup nor any other kind of error.
A question: do you think that the special 64offset should be present in the whole database or just in those files that exceed 2Gb (in my case ndb.kindx)? For my database everything is in classic except ndb.kindx. Might there be a problem with database mismatch? This is just a guess, I have found no report about it.

Marco

Re: netCDF problem with rev 14 : LFS support absent?

Posted: Mon Nov 15, 2010 9:51 am
by andrea marini
Marco, I have tried to fix the LF support in the latest revision (rev. 15). To activate it use --enable-netcdf-LFS=yes. Now mostly all databases are written using the LFS when this flag is activated. Not all databases but this is easy to change in mod_IO.F as there is the list there.

Can you try it and report and problem ?

Andrea

Re: netCDF problem with rev 14 : LFS support absent?

Posted: Mon Nov 15, 2010 10:02 am
by marco.govoni
I try.
Thanks
Marco

Re: netCDF problem with rev 14 : LFS support absent?

Posted: Mon Nov 15, 2010 3:13 pm
by marco.govoni
No, I still have that problem with rev 16.

Code: Select all

(...)
 <03h-12m-54s> SE indexes |##########          | [050%] 03m-36s(E) 07m-12s(X)
 <03h-13m-16s> SE indexes |###########         | [055%] 03m-59s(E) 07m-14s(X)
 <03h-13m-38s> SE indexes |############        | [060%] 04m-20s(E) 07m-13s(X)
 <03h-14m-01s> SE indexes |#############       | [065%] 04m-43s(E) 07m-16s(X)
 <03h-14m-23s> SE indexes |##############      | [070%] 05m-05s(E) 07m-16s(X)
 <03h-14m-41s> SE indexes |###############     | [075%] 05m-24s(E) 07m-12s(X)
 <03h-15m-05s> SE indexes |################    | [080%] 05m-47s(E) 07m-14s(X)
 <03h-15m-26s> SE indexes |#################   | [085%] 06m-08s(E) 07m-14s(X)
 <03h-15m-47s> SE indexes |##################  | [090%] 06m-29s(E) 07m-12s(X)
 <03h-16m-08s> SE indexes |################### | [095%] 06m-51s(E) 07m-12s(X)
r_setup is not created.
To let you know where it quits: I put a lot of prints in rev 14 in src/io/ioQINDX.F

Code: Select all

integer function ioQINDX(Xk,q,io_db)
 !
 use R_lattice,      ONLY:nqibz,nqbz,qindx_X,qindx_B,qindx_S,&
&                         bse_scattering,qp_states_k,nXkibz,qindx_alloc,&
&                         Xk_grid_is_uniform,bz_samp,nXkbz
 use IO_m,           ONLY:io_connect,io_disconnect,io_sec,&
&                         io_elemental,io_status,io_bulk,read_is_on,io_header,&
&                         ver_is_gt_or_eq
 implicit none
 type(bz_samp)::q,Xk
 integer      ::io_db
 !
 ! Work Space
 !
 print*, 'calling connect'
 ioQINDX=io_connect(desc='kindx',type=1,ENABLE_LARGE_FILE=.true.,ID=io_db)
 print*, 'called connect'
 if (ioQINDX/=0) goto 1
 !
 print*, 'calling 1'
 if (any((/io_sec(io_db,:)==1/))) then
   !
   ioQINDX=io_header(io_db,IMPOSE_SN=.true.)
   print*, 'called io_header'
   !
   ! In V. 3.0.7 a real parameter (RL_v_comp_norm) has been removed
   !
   if (.not.ver_is_gt_or_eq(io_db,(/3,0,8/))) ioQINDX=-1
   if (ioQINDX/=0) goto 1
   !
   call io_elemental(io_db,VAR="PARS",VAR_SZ=8)
   print*, 'called io_elemental'
   !
   call io_elemental(io_db,I0=nXkbz)
   print*, 'called io_elemental'
   call io_elemental(io_db,&
&       VAR=" Polarization last K   :",I0=nXkibz)
   print*, 'called io_elemental'
   call io_elemental(io_db,&
&       VAR=" QP states             :",I1=qp_states_k,CHECK=.true.,OP=(/">=","<="/))
   print*, 'called io_elemental'
   call io_elemental(io_db,I0=q%nibz)
   print*, 'called io_elemental'
   call io_elemental(io_db,I0=q%nbz)
   print*, 'called io_elemental'
   call io_elemental(io_db,&
&       VAR=" X grid is uniform     :",L0=Xk_grid_is_uniform)
   print*, 'called io_elemental'
   call io_elemental(io_db,&
&       VAR=" BS scattering         :",L0=bse_scattering,CHECK=.true.,OP=(/"=="/))
   print*, 'called io_elemental'
   call io_elemental(io_db,VAR="",VAR_SZ=0)
   print*, 'called io_elemental'
   ioQINDX=io_status(io_db)
   nqbz=q%nbz
   nqibz=q%nibz
   !
   if (ioQINDX/=0.or..not.any((/io_sec(io_db,:)>1/))) goto 1
 endif
 print*, 'called 1'
 !
 print*, 'calling 2'
 if (any((/io_sec(io_db,:)==2/))) then
   if (read_is_on(io_db)) allocate(q%pt(q%nibz,3))
   call io_bulk(io_db,VAR="Qpts",VAR_SZ=shape(q%pt))
   print*, 'called io_bulk'
   call io_bulk(io_db,R2=q%pt)
   print*, 'called io_bulk'
 endif
 print*, 'called 2' 
 !
 ! qindx_X(nqibz,nXkbz,2)
 ! qindx_S(qp_states_k(2),nqbz,2)
 ! (bse_scattering) -> qindx_B(nXkbz,nXkbz,2)
 !
 print*, 'calling 3'
 if (any((/io_sec(io_db,:)==3/))) then
   if (read_is_on(io_db)) call qindx_alloc()
   call io_bulk(io_db,VAR="Qindx",VAR_SZ=shape(qindx_X))
   print*, 'called io_bulk'
   call io_bulk(io_db,I3=qindx_X)
   print*, 'called io_bulk'
   if (Xk_grid_is_uniform) then
     call io_bulk(io_db,VAR="Sindx",VAR_SZ=shape(qindx_S))
   print*, 'called io_bulk'
     call io_bulk(io_db,I3=qindx_S)
   print*, 'called io_bulk'
   endif
   if (bse_scattering) then
     call io_bulk(io_db,VAR="Bindx",VAR_SZ=shape(qindx_B))
   print*, 'called io_bulk'
     call io_bulk(io_db,I3=qindx_B)
   print*, 'called io_bulk'
   endif
 endif
 print*, 'called 3'
 !
 print*, 'calling disconnect'
1 call io_disconnect(ID=io_db)
 print*, 'called disconnect'
 !
end function
and this is the result

Code: Select all

 calling connect
 called connect
 calling 1
 called io_header
 called io_elemental
 called io_elemental
 called io_elemental
 called io_elemental
 called io_elemental
 called io_elemental
 called io_elemental
 called io_elemental
 called io_elemental
 called 1
 calling 2
 called io_bulk
 called io_bulk
 called 2
 calling 3
 called io_bulk
 called io_bulk
after the second io_bulk call of block 3 the program stops. The routine ioQINDX is never terminated correctly. I suspect the same ending point also in rev 16, if you need those prints I can do the same for rev 16.
Tell me if you need more info.

Marco

Re: netCDF problem with rev 14 : LFS support absent?

Posted: Mon Nov 15, 2010 3:27 pm
by andrea marini
Dear Marco, if the database if correctly created with the LFS (check with the od program) than it is hard for me to help you further. Can you provide me the input files/databases to reproduce the error on my Linux box ? If I can reproduce it there I can fix it, otherwise I need to gain access to the machine where you're running.

Andrea

Re: netCDF problem with rev 14 : LFS support absent?

Posted: Mon Nov 15, 2010 3:44 pm
by marco.govoni
No problem to give you the inputs, or directly a tar of the SAVE directory right after the p2y.
However I have to warn you that each run take almost 4h (just 1 cpu because this part of the code is serial).

Marco