too many communicators parallelization error
Posted: Tue Mar 19, 2024 5:49 am
Hi all,
I keep running into the same error in my calculations for larger k grids. Everything goes smoothly until the BSE kernel calculation is finished, and then the computation crashes with a "too many communicators" error and no other explanation before it can start the haydock calculation (I believe this or something similar also happens when I'm trying to do a slepc calculation, but I'm not sure if the problems are related).
I've attached the setup file for my yambo compilation (yambo-5.1.1), as well as the crashed slurm log file and the yambo LOG file. Sometimes when this issue arises I'm able to run just the haydock step without any parallelization on one node, but that's not possible if the memory required for the computation exceeds the RAM of the node I'm using. Sometimes it helps to run with just mpi parallelization across a few nodes and no OMP paralellization, but that also sometimes fails.
I'm not really sure how to approach this issue, so any advice is appreciated.
Best,
Miles
I keep running into the same error in my calculations for larger k grids. Everything goes smoothly until the BSE kernel calculation is finished, and then the computation crashes with a "too many communicators" error and no other explanation before it can start the haydock calculation (I believe this or something similar also happens when I'm trying to do a slepc calculation, but I'm not sure if the problems are related).
Code: Select all
Fatal error in PMPI_Comm_split: Other MPI error, error stack:
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
PMPI_Comm_split(1294)...............: MPI_Comm_split(MPI_COMM_WORLD, color=2015, key=1, new_comm=0x1516916bb858) failed
PMPI_Comm_split(1276)...............:
MPIR_Comm_split_allgather(1005).....:
MPIR_Get_contextid_sparse_group(615): Too many communicators (0/2048 free on this process; ignore_id=0)
I'm not really sure how to approach this issue, so any advice is appreciated.
Best,
Miles