MPI Error while running IP RPA

Deals with issues related to computation of optical spectra in reciprocal space: RPA, TDDFT, local field effects.

Moderators: Davide Sangalli, andrea.ferretti, myrta gruning, andrea marini, Daniele Varsano, Conor Hogan

Post Reply
muhammadhasan
Posts: 45
Joined: Tue Aug 27, 2024 4:42 am

MPI Error while running IP RPA

Post by muhammadhasan » Tue Oct 01, 2024 2:48 pm

Hi,

I want to calculate dielectric function using IP RPA approximation. The input file (shared below) I have prepared contains total 625 Q points. When I submit the job on parallel, I have seen the job is not finished. Every time I am getting the following errors:
Fatal error in PMPI_Comm_split:
Other MPI error, error stack:
PMPI_Comm_split(1294)...............: MPI_Comm_split(comm=0xc40032a6, color=2, key=3, new_comm=0xe09157c) failed
PMPI_Comm_split(1276)...............:
MPIR_Comm_split_allgather(1005).....:
MPIR_Get_contextid_sparse_group(613): Too many communicators (24/2048 free on this process; ignore_id=0)

[cli_3]: aborting job:
Fatal error in PMPI_Comm_split:
Other MPI error, error stack:
PMPI_Comm_split(1294)...............: MPI_Comm_split(comm=0xc40032a6, color=2, key=3, new_comm=0xe09157c) failed
PMPI_Comm_split(1276)...............:
MPIR_Comm_split_allgather(1005).....:
MPIR_Get_contextid_sparse_group(613): Too many communicators (24/2048 free on this process; ignore_id=0)
When I check output, I have seen total 142 files ( q points) are being generated instead of 625 (total q points). Here is my input:

Code: Select all

optics                           # [R] Linear Response optical properties
chi                              # [R][CHI] Dyson equation for Chi.
dipoles                          # [R] Oscillator strenghts (or dipoles)
Chimod= "IP"                     # [X] IP/Hartree/ALDA/LRC/PF/BSfxc
% QpntsRXd
    1 |  625 |                       # [Xd] Transferred momenta
%
% BndsRnXd
  1 | 24 |                           # [Xd] Polarization function bands
%
% EnRngeXd
  0.00000 | 1.00000 |         eV    # [Xd] Energy range
%
% DmRngeXd
 0.003000 | 0.003000 |         eV    # [Xd] Damping range
%
ETStpsXd= 200                    # [Xd] Total Energy steps
% LongDrXd
 1.000000 | 0.000000 | 0.000000 |        # [Xd] [cc] Electric Field


This is the job file, I submitted on parallel mode.
#!/usr/bin/env bash
#SBATCH --job-name=gr_300K
#SBATCH --nodes=3 # node count
#SBATCH --ntasks-per-node=24 # number of tasks per node
#SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem-per-cpu=5gb # Job memory request
#SBATCH --time=60:00:00 # Time limit hrs:min:sec
#SBATCH --output=sdc.txt # Standard output and error log
#SBATCH --partition=skylake # MOAB/Torque called these queues

module load yambo
srun yambo -F input.in -J Full_mpi_final
Is there any recommendation I can follow to finish the job smoothly? Thank you in advance.

Best
M J Hasan
PhD Student
Mechanical Engineering
University of Maine

User avatar
Daniele Varsano
Posts: 4198
Joined: Tue Mar 17, 2009 2:23 pm
Contact:

Re: MPI Error while running IP RPA

Post by Daniele Varsano » Wed Oct 02, 2024 8:02 am

Dear M J Hasan,

using this input, you will obtain the IP dielectric function for all the q points, is this what you need?

To spot the problem, can you please share your report and one of the log files?
The input file you coped seems truncated, as there is a missing % symbol at the end, but probably it is just a matter of the copy/paste.

Best,
Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/

muhammadhasan
Posts: 45
Joined: Tue Aug 27, 2024 4:42 am

Re: MPI Error while running IP RPA

Post by muhammadhasan » Thu Oct 03, 2024 4:34 pm

Hi Professor,

Please check the attachment of the necessary files.

Thanks
Best
Hasan
You do not have the required permissions to view the files attached to this post.

muhammadhasan
Posts: 45
Joined: Tue Aug 27, 2024 4:42 am

Re: MPI Error while running IP RPA

Post by muhammadhasan » Thu Oct 03, 2024 6:24 pm

Hi,

using this input, you will obtain the IP dielectric function for all the q points, is this what you need?

--Yes, you are right!

User avatar
Daniele Varsano
Posts: 4198
Joined: Tue Mar 17, 2009 2:23 pm
Contact:

Re: MPI Error while running IP RPA

Post by Daniele Varsano » Fri Oct 04, 2024 4:42 pm

Dear Hasan,
the error you get: "Too many communicators" could be related to the large number of MPI process and the adopted parallel strategy for this run.

You can try to change the parallelization strategy by assigning it directly in the input file, adding these variables:

Code: Select all

X_and_IO_CPU= "1 1 1 12 2"     # [PARALLEL] CPUs for each role
X_and_IO_ROLEs= "q g k c v"       # [PARALLEL] CPUs roles (q,g,k,c,v)
and run with 24 cpus (1 node). Given the number of occupied and empty bands considered, I suggest you to not use a larger number of cpus. Or you can rise the number oin "q" e.g.

Code: Select all

X_and_IO_CPU= "3 1 1 12 2"     # [PARALLEL] CPUs for each role
X_and_IO_ROLEs= "q g k c v"       # [PARALLEL] CPUs roles (q,g,k,c,v)
and run with 72 cpus as you are doing now (3 nodes).

In the case it fails, please note that this calculation can be split in several runs containing less q points (QpntsRXd) each.

Please sign your post with your name and affiliation, you can do once for all filing your signature in the user profile.

Best,
Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/

muhammadhasan
Posts: 45
Joined: Tue Aug 27, 2024 4:42 am

Re: MPI Error while running IP RPA

Post by muhammadhasan » Mon Oct 21, 2024 4:35 am

Hi Professor,

Thank you so much all of your help.

At one point, when I increase the number of QpntsRXd and submit the job in parallel, I have got the following error.
slurmstepd-node-150: error: Detected 1 oom-kill event(s) in StepId=18211.2. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: node-150: task 70: Out Of Memory
slurmstepd-node-149: error: Detected 1 oom-kill event(s) in StepId=18211.2. Some of your processes may have been killed by the cgroup out-of-memory handler.
slurmstepd-node-147: error: Detected 1 oom-kill event(s) in StepId=18211.2. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
To run the job, I have followed one of your suggested strategy which is "calculation split in several runs containing less q points (QpntsRXd) each". Here is my job file:
#!/usr/bin/env bash
#SBATCH --job-name=gr_300K
#SBATCH --nodes=3 # node count
#SBATCH --ntasks-per-node=24 # number of tasks per node
#SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem-per-cpu=10gb # Job memory request
#SBATCH --time=120:00:00 # Time limit hrs:min:sec
#SBATCH --output=sdc.txt # Standard output and error log
#SBATCH --partition=skylake # MOAB/Torque called these queues

The input file I prepared as follows:

Code: Select all

optics                           # [R] Linear Response optical properties
chi                              # [R][CHI] Dyson equation for Chi.
dipoles                          # [R] Oscillator strenghts (or dipoles)
Chimod= "IP"                     # [X] IP/Hartree/ALDA/LRC/PF/BSfxc
% QpntsRXd
     1 |  8100 |                     # [Xd] Transferred momenta
%
% BndsRnXd
  1 | 16 |                           # [Xd] Polarization function bands
%
% EnRngeXd
  0.00000 | 1 |         eV    # [Xd] Energy range
%
% DmRngeXd
 0.003000 | 0.003000 |         eV    # [Xd] Damping range
%
ETStpsXd= 1000                    # [Xd] Total Energy steps
% LongDrXd
 1.000000 | 0.000000 | 0.000000 |        # [Xd] [cc] Electric Field
%
Would you please tell me how can I run the job smartly without getting the above error? I have attached important files if it helps to make your decision.
Thank you

Best
M J Hasan
PhD Student
Mechanical Engineering
University of Maine
You do not have the required permissions to view the files attached to this post.

User avatar
Daniele Varsano
Posts: 4198
Joined: Tue Mar 17, 2009 2:23 pm
Contact:

Re: MPI Error while running IP RPA

Post by Daniele Varsano » Mon Oct 21, 2024 8:04 am

Dear Hasan,

it should be useful to have a look at one of the log files, anyway it seems you have a memory error.
The issue here is that you have an unusual, very large number of k points. Are you sure you need such a large number?
This is very unusual for a calculation of this kind.

Anyway, to overcome the memory issue, being an IP calculation, you can try to run it in serial and see if it works.

Best,
Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/

muhammadhasan
Posts: 45
Joined: Tue Aug 27, 2024 4:42 am

Re: MPI Error while running IP RPA

Post by muhammadhasan » Mon Oct 21, 2024 3:54 pm

Hi Professor,

Basically, I want to solve an equation which is basically an integral of a function (inverse dielectric function ) of wave vectors (q). The authors of the paper replaced the integral by a sum of parallel wave vectors in the first Brillouin zone by introducing weight of q points (W_q). They mentioned that BZ is sampled by 90 90 1 grid and weight of the each q point is 1/N, where N is the total number of q points in the BZ (that means 8100). That also means (I believe) they turn off symmetry option.

I started to follow their procedure with "symmetry option turn off" in nscf calculation (Quantum Espresso). However, after getting the 'out of memory" error, I had changed my mind and considered the weight of q points with "symmetry option turn on" in nscf calculation and this time I got 721 q points (IBZ) (from 8100 q points in BZ) with each of q points weight printed in yambo as follows:

Code: Select all

  K [  1]:  0.000000  0.000000  0.000000 [rlu]
        : 0.000000  0.000000  0.000000 [iku]
        : 0.000000  0.000000  0.000000 [cc]
        : weight :  0.123457E-3
        : wf components:  19651
  K [  2]:  0.000000  0.011111  0.000000 [rlu]
        : 0.000000  0.011111  0.000000 [iku]
        : 0.000000  0.017341  0.000000 [cc]
        : weight :  0.740743E-3
        : wf components:  19651
  K [  3]:  0.000000  0.022222  0.000000 [rlu]
        : 0.000000  0.022222  0.000000 [iku]
        : 0.000000  0.034682  0.000000 [cc]
        : weight :  0.740743E-3
        : wf components:  19637
.
.
.
  K [720]:  0.322222  0.333333  0.000000 [rlu]
        : 0.322222  0.494444  0.000000 [iku]
        : 0.435513  0.771672  0.000000 [cc]
        : weight :  0.001481
        : wf components:  19849
  K [721]:  0.333333  0.333333  0.000000 [rlu]
        : 0.333333  0.500000  0.000000 [iku]
        : 0.450531  0.780343  0.000000 [cc]
        : weight :  0.246914E-3
        : wf components:  19863
I have used these weights, however, my result didn't match with the author result (Please see the attachment of the result). After that, I am thinking that I should run the simulation with symmetry turn off to check whether the results match or not. I also attached the LOG files for your kind perusal, professor.

Best
M J Hasan
PhD Student
Mechanical Engineering
University of Maine
You do not have the required permissions to view the files attached to this post.

User avatar
Daniele Varsano
Posts: 4198
Joined: Tue Mar 17, 2009 2:23 pm
Contact:

Re: MPI Error while running IP RPA

Post by Daniele Varsano » Tue Oct 22, 2024 8:01 am

Dear M J Hasan,

1. The weight reported are normalized to the total number of k points in the BZ, so you should multiply that number by 8100.

2. Alternatively, you can use ypp (ypp -k) to print the correct weight (at the moment, I do not have the possibility to look at the correct command option).

3. Unfortunately, the log file does not report any useful info, have you tried to run the calculation in serial? in such a way you have all the memory of the node available.

Best,

Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/

muhammadhasan
Posts: 45
Joined: Tue Aug 27, 2024 4:42 am

Re: MPI Error while running IP RPA

Post by muhammadhasan » Wed Oct 23, 2024 8:15 pm

Hi Professor,

Thank you for mentioning about the weight normalization. And I am running it in serial, so far no problem. I will let you know. Thank you so much again

Best
M J Hasan
PhD student
Mechanical Engineering
University of Maine

Post Reply