Page 1 of 1

too many communicators parallelization error

Posted: Tue Mar 19, 2024 5:49 am
by milesj
Hi all,

I keep running into the same error in my calculations for larger k grids. Everything goes smoothly until the BSE kernel calculation is finished, and then the computation crashes with a "too many communicators" error and no other explanation before it can start the haydock calculation (I believe this or something similar also happens when I'm trying to do a slepc calculation, but I'm not sure if the problems are related).

Code: Select all

Fatal error in PMPI_Comm_split: Other MPI error, error stack:
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
PMPI_Comm_split(1294)...............: MPI_Comm_split(MPI_COMM_WORLD, color=2015, key=1, new_comm=0x1516916bb858) failed
PMPI_Comm_split(1276)...............:
MPIR_Comm_split_allgather(1005).....:
MPIR_Get_contextid_sparse_group(615): Too many communicators (0/2048 free on this process; ignore_id=0)
I've attached the setup file for my yambo compilation (yambo-5.1.1), as well as the crashed slurm log file and the yambo LOG file. Sometimes when this issue arises I'm able to run just the haydock step without any parallelization on one node, but that's not possible if the memory required for the computation exceeds the RAM of the node I'm using. Sometimes it helps to run with just mpi parallelization across a few nodes and no OMP paralellization, but that also sometimes fails.
I'm not really sure how to approach this issue, so any advice is appreciated.

Best,
Miles

Re: too many communicators parallelization error

Posted: Wed Mar 20, 2024 5:59 pm
by Davide Sangalli
Dear Miles,
since this happens after the calculation of the kernel, I suspect it has something to do with the solver.

Checking the code I see this might be due to the MPI implementation in the Haydock solver.
Indeed there this code which is very likely causing the issue.

Code: Select all

     do i_g=1,BS_nT_grps                                                
       !                                                                
       if (.not.PAR_IND_T_Haydock%element_1D(i_g)) then                 
         local_key=-1                                                   
         PAR_COM_T_Haydock(i_g)%my_CHAIN=BS_nT_grps+1                   
       else                                                             
         !                                                              
         local_key = 1                                                  
         if (PAR_IND_T_groups%element_1D(i_g)) local_key = 0            
         !                                                              
         PAR_COM_T_Haydock(i_g)%n_CPU=PAR_COM_T_Haydock(i_g)%n_CPU+1    
         PAR_COM_T_Haydock(i_g)%my_CHAIN = i_g                          
         !                                                              
       endif                                                            
       !                                                                
       call CREATE_the_COMM(PAR_COM_WORLD%COMM,PAR_COM_T_Haydock(i_g),local_key)
       !                                                                
     enddo                                                              
I'll try to get in touch with other developers to discuss how this could be solved.
For now the only suggestion I can give is to not distribute over the variable eh, but rather us k and t, to minimize the number of groups (BS_nT_grps).
Set something like this in the input

Code: Select all

BS_ROLEs= "k.eh.t"
BS_CPU="nk.1.nk"
with nk*nt = ncpu.

Best,
D.

Re: too many communicators parallelization error

Posted: Mon Apr 15, 2024 3:27 am
by milesj
Hi Davide,

Sorry I haven't had a chance to try this until recently, but I've gotten the same error both with and without openMP parallelisation:

Code: Select all

At line 82 of file /global/homes/m/milesj/my_modules/yambo_cpu/yambo-5.1.0/src/parallel/PARALLEL_get_user_structure.F
Fortran runtime error: Bad value during integer read
That's from the slurm log file, here's the yambo log file:

Code: Select all

<---> P1: [01] MPI/OPENMP structure, Files & I/O Directories
 <---> P1-nid005208: MPI Cores-Threads   : 24(CPU)-1(threads)
 <---> P1-nid005208: MPI Cores-Threads   : BS(environment)-k.eh.t(CPUs)-4.1.6(ROLEs)
 <---> P1-nid005208: [02] CORE Variables Setup
 <---> P1-nid005208: [02.01] Unit cells
 <---> P1-nid005208: [02.02] Symmetries
 <---> P1-nid005208: [02.03] Reciprocal space
 <---> P1-nid005208: [02.04] K-grid lattice
 <---> P1-nid005208: Using the new bz sampling setup
 <---> P1-nid005208: Grid dimensions      :  12  12  12
 <---> P1-nid005208: [02.05] Energies & Occupations
 <06s> P1-nid005208: [03] Transferred momenta grid and indexing
 <06s> P1-nid005208: [MEMORY] Alloc bare_qpg( 1.800992 [Gb]) TOTAL:  1.957100 [Gb] (traced)  2.093616 [Gb] (memstat)
 <13s> P1-nid005208: [04] Dipoles
 <13s> P1-nid005208: DIPOLES parallel ENVIRONMENT is incomplete. Switching to defaults
 <13s> P1-nid005208: [PARALLEL DIPOLES for K(ibz) on 2 CPU] Loaded/Total (Percentual):259/518(50%)
 <13s> P1-nid005208: [PARALLEL DIPOLES for CON bands on 2 CPU] Loaded/Total (Percentual):3/5(60%)
 <13s> P1-nid005208: [PARALLEL DIPOLES for VAL bands on 6 CPU] Loaded/Total (Percentual):2/11(18%)
 <13s> P1-nid005208: [DIP] Checking dipoles header
 <13s> P1-nid005208: [WARNING] [r,Vnl^pseudo] included in position and velocity dipoles.
 <13s> P1-nid005208: [WARNING] In case H contains other non local terms, these are neglected
This was on 24 nodes on the perlmutter cpu supercomputer cluster with no openMP parallelisation, but I've also tried it with OMP_NUM_THREADS=128, and also setting it to 48.1.64, all giving the same error.

Thanks,
Miles