Dear Haseeb,
the kernel building seems to be rather balanced, the expected time to complete the task in each cpu is similar (around 3h30m).
The problem here is that the first 2 cpus starts the calculations after around 3h which is unexpected and actually I do not have an idea on what is going on. From the log and report it seems all is ok.
Can you try to repeat the calculation using a different number of cpus? this could help to spot the problem.
As suggested by Claudio it is a good practice to include the parallelization strategy explicitly as the default one could not be optimal.
Daniele
parallel problem with a full frequency approach
Moderators: myrta gruning, andrea marini, Daniele Varsano, Conor Hogan
- Daniele Varsano
- Posts: 4199
- Joined: Tue Mar 17, 2009 2:23 pm
- Contact:
Re: parallel problem with a full frequency approach
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/
-
- Posts: 169
- Joined: Sat Aug 17, 2019 2:48 pm
Re: parallel problem with a full frequency approach
Dear Yambo developers,
I was having the memory issues in solving BSE and I could not go with more than 40 BSEBands in my 125 GB single node. So, I decided to add another node to have more memory. But what I have found running the calculation on more than one node also not helpful since I need to run 2 mpi processes at least to spread the calculation on multiple nodes.
Q1: If a single mpi tasks is taking 100 GB ram then will 2 mpi tasks take 200 GB ram in Yambo? If this is so, then what will be the benefit of adding more nodes in terms of ram, if I can do the same calculation on a single core and one a single node?
Here is my SLURM script,
Thanks,
I was having the memory issues in solving BSE and I could not go with more than 40 BSEBands in my 125 GB single node. So, I decided to add another node to have more memory. But what I have found running the calculation on more than one node also not helpful since I need to run 2 mpi processes at least to spread the calculation on multiple nodes.
Q1: If a single mpi tasks is taking 100 GB ram then will 2 mpi tasks take 200 GB ram in Yambo? If this is so, then what will be the benefit of adding more nodes in terms of ram, if I can do the same calculation on a single core and one a single node?
Here is my SLURM script,
Code: Select all
#!/bin/bash
#
#SBATCH --job-name=H1
#SBATCH --partition=debug
#SBATCH --output=res_bse.txt
#SBATCH -t 999:00:00
#SBATCH --tasks-per-node=1
#SBATCH --nodes=2
#SBATCH --nodelist=compute-0-2,compute-0-3
#SBATCH --mem-per-cpu=122000M
#SBATCH --cpus-per-task=1 ### Number of threads per task (OMP threads)
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MKL_NUM_THREADS=$SLURM_CPUS_PER_TASK
nodes=2
tasks_per_node=1
nthreads=1
ncpu=`echo $nodes $tasks_per_node | awk '{print $1*$2}'`
bindir=/export/installs/Yambo_intel/Yambo4.5.1_OpenMP/yambo-4.5.1/bin
echo "Running on $ncpu MPI, $nthreads OpenMP threads"
srun hostname
source /export/installs/intelcc/parallel_studio_xe_2019.1.053/bin/psxevars.sh
which mpirun
export I_MPI_HYDRA_TOPOLIB=ipl; gdb --args mpiexec.hydra --verbose -n 1 /bin/ls
mpirun -np $ncpu $bindir/yambo $bindir/yambo -F bse.in -C bse
Haseeb Ahmad
MS - Physics,
LUMS - Pakistan
MS - Physics,
LUMS - Pakistan
- Daniele Varsano
- Posts: 4199
- Joined: Tue Mar 17, 2009 2:23 pm
- Contact:
Re: parallel problem with a full frequency approach
Dear Haseeb,
from your last post, it is not clear if you have problem in building the BSE matrix, or in solving (diagonalize) it.
If the question is about diagonalization, the parallelism enters only if you linked yambo with the ScaLAPACK libraries, anyway
I'm not sure if at this stage there will be a distribution of the memory among CPUs.
Best,
Daniele
from your last post, it is not clear if you have problem in building the BSE matrix, or in solving (diagonalize) it.
If the question is about diagonalization, the parallelism enters only if you linked yambo with the ScaLAPACK libraries, anyway
I'm not sure if at this stage there will be a distribution of the memory among CPUs.
Best,
Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/