MPI calculation stops and does not continue
Posted: Tue May 30, 2023 8:38 am
Dear all,
I am experiencing a technical issue while doing a mpi calculation in my cluster.
On the one hand, I am sending the run_configure.sh I have used for configuring the compilation of the code. Then, I did "make core" for installing the executables. Likewise, I am sending the config.log, ./config/report and ./config/setup. If anything more is needed for checking the parallel installation, please ask me. As you can see, the installation ends and it seems to be a parallelized one.
With this installation, on the other hand, I have been playing with the GW parallel tutorial: https://www.yambo-code.eu/wiki/index.ph ... strategies. For this, I have done 3 different calculations: MPI 1 + OMP 1 (which ends without any problem in 28 mins 23 s, I am sending the respective log) MPI 1 + OMP 16 (which also ends without any problem, in 5 mins 40 s, I am sending the respective log) and the problematic one, MPI 16 + OMP 1. For the latter, the calculation does not fail (I mean there is not any error message and it continues to be in queue) but it stops and does not continue. Moreover, if you one compares the r-*_1 and any other r-*_X in the LOG folder generated, it can be seen that all cores are seemingly doing the same calculation. Attached I send you the log files for the 1 and 2 cpu. I am also sending the run.sh I have been using for sending the jobs.
I do not very well understand what is going on. If the problem is related with the installation, with the script for sending the jobs (which I took almost from the wiki) or with some aspects I might have missed within the architecture of my cluster.
If anything more is needed to trace back the issue, please ask me.
Thank you very much in advance,
Peio.
I am experiencing a technical issue while doing a mpi calculation in my cluster.
On the one hand, I am sending the run_configure.sh I have used for configuring the compilation of the code. Then, I did "make core" for installing the executables. Likewise, I am sending the config.log, ./config/report and ./config/setup. If anything more is needed for checking the parallel installation, please ask me. As you can see, the installation ends and it seems to be a parallelized one.
With this installation, on the other hand, I have been playing with the GW parallel tutorial: https://www.yambo-code.eu/wiki/index.ph ... strategies. For this, I have done 3 different calculations: MPI 1 + OMP 1 (which ends without any problem in 28 mins 23 s, I am sending the respective log) MPI 1 + OMP 16 (which also ends without any problem, in 5 mins 40 s, I am sending the respective log) and the problematic one, MPI 16 + OMP 1. For the latter, the calculation does not fail (I mean there is not any error message and it continues to be in queue) but it stops and does not continue. Moreover, if you one compares the r-*_1 and any other r-*_X in the LOG folder generated, it can be seen that all cores are seemingly doing the same calculation. Attached I send you the log files for the 1 and 2 cpu. I am also sending the run.sh I have been using for sending the jobs.
I do not very well understand what is going on. If the problem is related with the installation, with the script for sending the jobs (which I took almost from the wiki) or with some aspects I might have missed within the architecture of my cluster.
If anything more is needed to trace back the issue, please ask me.
Thank you very much in advance,
Peio.