MPI calculation stops and does not continue

Various technical topics such as parallelism and efficiency, netCDF problems, the Yambo code structure itself, are posted here.

Moderators: Davide Sangalli, andrea.ferretti, myrta gruning, andrea marini, Daniele Varsano, Conor Hogan, Nicola Spallanzani

Post Reply
peiogargor
Posts: 7
Joined: Tue Feb 21, 2023 7:42 pm

MPI calculation stops and does not continue

Post by peiogargor » Tue May 30, 2023 8:38 am

Dear all,

I am experiencing a technical issue while doing a mpi calculation in my cluster.

On the one hand, I am sending the run_configure.sh I have used for configuring the compilation of the code. Then, I did "make core" for installing the executables. Likewise, I am sending the config.log, ./config/report and ./config/setup. If anything more is needed for checking the parallel installation, please ask me. As you can see, the installation ends and it seems to be a parallelized one.

With this installation, on the other hand, I have been playing with the GW parallel tutorial: https://www.yambo-code.eu/wiki/index.ph ... strategies. For this, I have done 3 different calculations: MPI 1 + OMP 1 (which ends without any problem in 28 mins 23 s, I am sending the respective log) MPI 1 + OMP 16 (which also ends without any problem, in 5 mins 40 s, I am sending the respective log) and the problematic one, MPI 16 + OMP 1. For the latter, the calculation does not fail (I mean there is not any error message and it continues to be in queue) but it stops and does not continue. Moreover, if you one compares the r-*_1 and any other r-*_X in the LOG folder generated, it can be seen that all cores are seemingly doing the same calculation. Attached I send you the log files for the 1 and 2 cpu. I am also sending the run.sh I have been using for sending the jobs.

I do not very well understand what is going on. If the problem is related with the installation, with the script for sending the jobs (which I took almost from the wiki) or with some aspects I might have missed within the architecture of my cluster.

If anything more is needed to trace back the issue, please ask me.

Thank you very much in advance,

Peio.
You do not have the required permissions to view the files attached to this post.
Dr. Peio Garcia-Goiricelaya
Postdoctoral Researcher
Materials Physics Center, Donostia-San Sebastian (Basque Country, Spain)
https://cfm.ehu.es/es/team/peio-garcia-goiricelaya/

User avatar
Daniele Varsano
Posts: 3773
Joined: Tue Mar 17, 2009 2:23 pm
Contact:

Re: MPI calculation stops and does not continue

Post by Daniele Varsano » Tue May 30, 2023 8:53 am

Dear Peio,

it seems that Yambo is running in parallel, anyway the adopted parallelization strategy reported in the log files it is not the one you specified in the script, so something happened. Moreover, such parallelization seems to be not optimal.
Can you please post your MPI input and report file?

Best,
Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/

peiogargor
Posts: 7
Joined: Tue Feb 21, 2023 7:42 pm

Re: MPI calculation stops and does not continue

Post by peiogargor » Tue May 30, 2023 9:00 am

Hello Daniele,

I send you attached the input file of the MPI calculation as well as its report file.

Thank you,

Peio.
You do not have the required permissions to view the files attached to this post.
Dr. Peio Garcia-Goiricelaya
Postdoctoral Researcher
Materials Physics Center, Donostia-San Sebastian (Basque Country, Spain)
https://cfm.ehu.es/es/team/peio-garcia-goiricelaya/

User avatar
Daniele Varsano
Posts: 3773
Joined: Tue Mar 17, 2009 2:23 pm
Contact:

Re: MPI calculation stops and does not continue

Post by Daniele Varsano » Tue May 30, 2023 5:52 pm

Dear Peio,

I'm not totally sure where the problem arises.
But it is possible that some of the variables of that tutorials are obsolete.

Can you try to replace the keywords in your input:

Code: Select all

X_and_IO_CPU= ""                 # [PARALLEL] CPUs for each role
X_and_IO_ROLEs= ""               # [PARALLEL] CPUs roles (q,g,k,c,v)
X_and_IO_nCPU_LinAlg_INV=-1      # [PARALLEL] CPUs for Linear Algebra (if -1 it is automatically set)
with the same values you set before.

I would also set the parallel linear algebra to 1, as you have a very small matrix to invert.

Best,
Daniele

PS: variables controlling the parallelism are added in the input when generate with the -par verbosity: e.g.

Code: Select all

yambo -gw0 p -g n -V par
In this way you are sure you are using the correct names for variables.
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/

peiogargor
Posts: 7
Joined: Tue Feb 21, 2023 7:42 pm

Re: MPI calculation stops and does not continue

Post by peiogargor » Tue May 30, 2023 7:43 pm

Dear Daniele,

You hit the mark! It seems that the problem was related with the name of the input variables, once changed it worked perfectly, i.e.:

X_CPU -> X_and_IO_CPU
X_ROLES -> X_and_IO_ROLES
X_nCPU_LinAlg_INV -> X_and_IO_nCPU_LinAlg_INV

The value of the parallel linear algebra does not seem to play any big role, as the calculation performed with equal time for -1, 1 and 16. It lasted 1 m 54 s, so big difference even compared to the openmp calculation.

Thank you very much!

PD: I would update the script in the tutorial for avoiding any other to get confused with the input variable names.
Dr. Peio Garcia-Goiricelaya
Postdoctoral Researcher
Materials Physics Center, Donostia-San Sebastian (Basque Country, Spain)
https://cfm.ehu.es/es/team/peio-garcia-goiricelaya/

User avatar
Daniele Varsano
Posts: 3773
Joined: Tue Mar 17, 2009 2:23 pm
Contact:

Re: MPI calculation stops and does not continue

Post by Daniele Varsano » Tue May 30, 2023 8:32 pm

Hola Peio,

great it is solved.
yes, we need to update the tutorial, I will do as asap! Thanks for spotting it!

Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/

Post Reply