blue gene
Moderators: Davide Sangalli, andrea.ferretti, myrta gruning, andrea marini, Daniele Varsano, Conor Hogan
-
- Posts: 149
- Joined: Tue Apr 08, 2014 6:05 am
blue gene
Dear Developers,
Is YAMBO 4.0 rev. 100 stable enough to run on some 1000 cores on blue gene machine for production run ?
Up to how many cores there is still NO saturation effect for YAMBO 3.4.2 ?
Best wishes
Martin
Is YAMBO 4.0 rev. 100 stable enough to run on some 1000 cores on blue gene machine for production run ?
Up to how many cores there is still NO saturation effect for YAMBO 3.4.2 ?
Best wishes
Martin
Martin Spenke, PhD Student
Theoretisch-Physikalisches Institut
Universität Hamburg, Germany
Theoretisch-Physikalisches Institut
Universität Hamburg, Germany
- Daniele Varsano
- Posts: 4209
- Joined: Tue Mar 17, 2009 2:23 pm
- Contact:
Re: blue gene
Dear Martin,
it depends on the runlevel. GW calculations has been run up to more than 16K cpu, without reaching saturation.
Bethe-Salpeter calculations are more delicate, we are actually investigating performance and optimizing in these days.
For the 3.4.2 I do not have a clear answer.
Best,
Daniele
it depends on the runlevel. GW calculations has been run up to more than 16K cpu, without reaching saturation.
Bethe-Salpeter calculations are more delicate, we are actually investigating performance and optimizing in these days.
For the 3.4.2 I do not have a clear answer.
Best,
Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/
- Davide Sangalli
- Posts: 641
- Joined: Tue May 29, 2012 4:49 pm
- Location: Via Salaria Km 29.3, CP 10, 00016, Monterotondo Stazione, Italy
- Contact:
Re: blue gene
Dear Martin,
4.0 is still a testing version. As Daniele pointed out it depends on the runlevel.
The GW part has been extensively tested in parallel, the BSE, for example, much less.
For 3.4.2, if I remember correctly, the scaling was reasonable up to about 200 cores.
Above it was very bad. It was a result specific of a run-level and system dependent.
But I think it gives you an idea.
Best,
D.
4.0 is still a testing version. As Daniele pointed out it depends on the runlevel.
The GW part has been extensively tested in parallel, the BSE, for example, much less.
For 3.4.2, if I remember correctly, the scaling was reasonable up to about 200 cores.
Above it was very bad. It was a result specific of a run-level and system dependent.
But I think it gives you an idea.
Best,
D.
Davide Sangalli, PhD
CNR-ISM, Division of Ultrafast Processes in Materials (FLASHit) and MaX Centre
https://sites.google.com/view/davidesangalli
http://www.max-centre.eu/
CNR-ISM, Division of Ultrafast Processes in Materials (FLASHit) and MaX Centre
https://sites.google.com/view/davidesangalli
http://www.max-centre.eu/
-
- Posts: 149
- Joined: Tue Apr 08, 2014 6:05 am
Re: blue gene
Dear Daniele and Davide,
many thanks.
I will try GW and BSE on my system using yambo_4.0, and will come back.
Best wishes
Martin
many thanks.
I will try GW and BSE on my system using yambo_4.0, and will come back.
Best wishes
Martin
Martin Spenke, PhD Student
Theoretisch-Physikalisches Institut
Universität Hamburg, Germany
Theoretisch-Physikalisches Institut
Universität Hamburg, Germany
-
- Posts: 149
- Joined: Tue Apr 08, 2014 6:05 am
Re: blue gene
Dear Daniele,
when i run yambo_4.0 on bluegene using for instance this command :
"yambo -F Inputs/02_QP_PPA_pure-mpi-q -J 02_QP_PPA_pure-mpi-q "
I obtain the error "UNKNOWN OPTION F" , "UNKNOWN OPTION J"
How should i run yambo_4.0 executable on bluegene machines properly ?
Bests
Martin
when i run yambo_4.0 on bluegene using for instance this command :
"yambo -F Inputs/02_QP_PPA_pure-mpi-q -J 02_QP_PPA_pure-mpi-q "
I obtain the error "UNKNOWN OPTION F" , "UNKNOWN OPTION J"
How should i run yambo_4.0 executable on bluegene machines properly ?
Bests
Martin
Martin Spenke, PhD Student
Theoretisch-Physikalisches Institut
Universität Hamburg, Germany
Theoretisch-Physikalisches Institut
Universität Hamburg, Germany
-
- Posts: 214
- Joined: Fri Jan 31, 2014 11:13 am
Re: blue gene
Hi Martin,
it seems that command line options are not recognized.
what is the actual command you use to run yambo and to provide the options ?
for instance, on BGQ, I use either one of the following two options:
Andrea
it seems that command line options are not recognized.
what is the actual command you use to run yambo and to provide the options ?
for instance, on BGQ, I use either one of the following two options:
Code: Select all
runjob --np 1024 --ranks-per-node 8 --envs OMP_NUM_THREADS=4 : ./yambo -F yambo.in -J MYDIR
or
runjob --np 1024 --ranks-per-node 8 --envs OMP_NUM_THREADS=4 --exe ./yambo --args -F yambo.in --args -J MYDIR
Andrea Ferretti, PhD
CNR-NANO-S3 and MaX Centre
via Campi 213/A, 41125, Modena, Italy
Tel: +39 059 2055322; Skype: andrea_ferretti
URL: http://www.nano.cnr.it
CNR-NANO-S3 and MaX Centre
via Campi 213/A, 41125, Modena, Italy
Tel: +39 059 2055322; Skype: andrea_ferretti
URL: http://www.nano.cnr.it
-
- Posts: 149
- Joined: Tue Apr 08, 2014 6:05 am
Re: blue gene
Dear Andrea,
many thanks, it works for me, too.
Addition of "..args" solved the problem.
However, is on many cores (1024 - 2048 physical cores) the automatic task distribution not possible ?
and so avoiding to set numbers by hands for CPU in the input file ?
Best wishes
Martin
many thanks, it works for me, too.
Addition of "..args" solved the problem.
However, is on many cores (1024 - 2048 physical cores) the automatic task distribution not possible ?
and so avoiding to set numbers by hands for CPU in the input file ?
Best wishes
Martin
Martin Spenke, PhD Student
Theoretisch-Physikalisches Institut
Universität Hamburg, Germany
Theoretisch-Physikalisches Institut
Universität Hamburg, Germany
- Davide Sangalli
- Posts: 641
- Joined: Tue May 29, 2012 4:49 pm
- Location: Via Salaria Km 29.3, CP 10, 00016, Monterotondo Stazione, Italy
- Contact:
Re: blue gene
Dear Martin,
you can try the automatic task distribution on many cores, but I fear it will not work.
Beside that, it is for sure not the most efficient way of running yambo on 1000 cores or more.
I would say the best way is to try with MPI+OpenMP parallelization setting in input the MPI strategy.
D.
you can try the automatic task distribution on many cores, but I fear it will not work.
Beside that, it is for sure not the most efficient way of running yambo on 1000 cores or more.
I would say the best way is to try with MPI+OpenMP parallelization setting in input the MPI strategy.
D.
Davide Sangalli, PhD
CNR-ISM, Division of Ultrafast Processes in Materials (FLASHit) and MaX Centre
https://sites.google.com/view/davidesangalli
http://www.max-centre.eu/
CNR-ISM, Division of Ultrafast Processes in Materials (FLASHit) and MaX Centre
https://sites.google.com/view/davidesangalli
http://www.max-centre.eu/
-
- Posts: 214
- Joined: Fri Jan 31, 2014 11:13 am
Re: blue gene
Dear Martin,
I agree pretty much with Davide's suggestion. When running on many cores it is best to set the parallelism by hand.
Moreover, on BGQ a hybrid scheme MPI+OpenMP is almost mandatory for yambo because of memory requirements.
I'll try to explain myself better:
on BGQ you have 16 GB RAM per node. First you need to estimate how much memory your system will take per MPI task,
say about 4GB (of course this depends on the parallelization, but we need a rough estimate).
Then, it means you can use up to 4 MPI tasks per node. In order not to waste performance, you would need to run a number of openmp threads per each MPI task in order to better exploit the cores you are not using. Because of architectural reasons, BGQ can sustain up to 64 threads per node, so you can use up to 16 threads per MPI task.
Coming to yambo: I would set nthreads=8 as a default, then you can increase it to 16 in the self-energy part (SE_threads), while I would lower it to 8 or 4 in the calculation of the response function (X_threads) because of a worse openmp scaling.
Regarding the MPI, you have the flexibility to distribute you MPI tasks (4 per node * number_of_nodes) into a number of levels.
concerning X: q and k distribute less memory, q does not communicate at all, both of them can lead to some degrees of load unbalance; therefore I would try to use c and v parallelism as much as possible, resorting to q and k when needed (after all these levels of parallelisms are not that bad either)
concerning sigma: I would avoid q parallelism if possible (load unbalance), use qp as much as possible, and then start with b. If you have memory issues, b distributes memory at best (as c and v do for X)
hope this helps
Andrea
I agree pretty much with Davide's suggestion. When running on many cores it is best to set the parallelism by hand.
Moreover, on BGQ a hybrid scheme MPI+OpenMP is almost mandatory for yambo because of memory requirements.
I'll try to explain myself better:
on BGQ you have 16 GB RAM per node. First you need to estimate how much memory your system will take per MPI task,
say about 4GB (of course this depends on the parallelization, but we need a rough estimate).
Then, it means you can use up to 4 MPI tasks per node. In order not to waste performance, you would need to run a number of openmp threads per each MPI task in order to better exploit the cores you are not using. Because of architectural reasons, BGQ can sustain up to 64 threads per node, so you can use up to 16 threads per MPI task.
Coming to yambo: I would set nthreads=8 as a default, then you can increase it to 16 in the self-energy part (SE_threads), while I would lower it to 8 or 4 in the calculation of the response function (X_threads) because of a worse openmp scaling.
Regarding the MPI, you have the flexibility to distribute you MPI tasks (4 per node * number_of_nodes) into a number of levels.
concerning X: q and k distribute less memory, q does not communicate at all, both of them can lead to some degrees of load unbalance; therefore I would try to use c and v parallelism as much as possible, resorting to q and k when needed (after all these levels of parallelisms are not that bad either)
concerning sigma: I would avoid q parallelism if possible (load unbalance), use qp as much as possible, and then start with b. If you have memory issues, b distributes memory at best (as c and v do for X)
hope this helps
Andrea
Andrea Ferretti, PhD
CNR-NANO-S3 and MaX Centre
via Campi 213/A, 41125, Modena, Italy
Tel: +39 059 2055322; Skype: andrea_ferretti
URL: http://www.nano.cnr.it
CNR-NANO-S3 and MaX Centre
via Campi 213/A, 41125, Modena, Italy
Tel: +39 059 2055322; Skype: andrea_ferretti
URL: http://www.nano.cnr.it