Parallelization error when running on multiple nodes

Various technical topics such as parallelism and efficiency, netCDF problems, the Yambo code structure itself, are posted here.

Moderators: Davide Sangalli, andrea.ferretti, myrta gruning, andrea marini, Daniele Varsano, Conor Hogan, Nicola Spallanzani

Post Reply
stefan19rkc
Posts: 5
Joined: Fri Mar 24, 2023 1:11 pm

Parallelization error when running on multiple nodes

Post by stefan19rkc » Wed Jul 12, 2023 9:32 am

Dear Yambo community,

I have been working on some calculations with a very large system, and have managed to run them on a single cluster node. When I try running the same calculation on 3 instances of the same node, I encounter an OOM error. Is there a way I could quickly resolve this?

Please find the log, input file, and the job launch script attached and let me know if you have additional questions. Note: I tried running on both MPI (which worked in the first place, on only one node) and OpenMP, to no difference.

Kind regards,
Stefan Velja
You do not have the required permissions to view the files attached to this post.

User avatar
Daniele Varsano
Posts: 3980
Joined: Tue Mar 17, 2009 2:23 pm
Contact:

Re: Parallelization error when running on multiple nodes

Post by Daniele Varsano » Wed Jul 12, 2023 10:34 am

Dear Stefano,

please note, you are asking for 144 tasks, but then you are assigning 288 tasks in the input variables.
I do not know how many cores you have per node. Of course, you can use less of them in order to have more memory per task available.
Indeed, the effectively used parallel distribution in the log file is not the one indicated in input.

In order to optimize memory distribution among tasks, you can try to set:

Code: Select all

X_and_IO_CPU= "1 1 1 32 9"     # [PARALLEL] CPUs for each role
X_and_IO_ROLEs= "q g k c v"       # [PARALLEL] CPUs roles (q,g,k,c,v)
if you plan to use 288 tasks,

or something like:

Code: Select all

X_and_IO_CPU= "1 1 1 47 4"     # [PARALLEL] CPUs for each role
X_and_IO_ROLEs= "q g k c v"       # [PARALLEL] CPUs roles (q,g,k,c,v)
if you plan to use 188 tasks.

Finally, I'm not sure if you gain much going in hyperthreading.

Best,
Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/

User avatar
Nicola Spallanzani
Posts: 65
Joined: Thu Nov 21, 2019 10:15 am

Re: Parallelization error when running on multiple nodes

Post by Nicola Spallanzani » Wed Jul 12, 2023 11:23 am

Dear Stefan,
as additional information, in the jobscript there are these two lines:

Code: Select all

#SBATCH --cpus-per-task=2

export OMP_NUM_THREADS=6
they have to be set at the same value. To make it automatic you can do this:

Code: Select all

#SBATCH --cpus-per-task=2

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
Best,
Nicola
Nicola Spallanzani, PhD
S3 Centre, Istituto Nanoscienze CNR and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu

Post Reply