blue gene

You can find here problems arising when using old releases of Yambo (< 5.0). Issues as parallelization strategy, performance issues and other technical aspects.

Moderators: Davide Sangalli, andrea.ferretti, myrta gruning, andrea marini, Daniele Varsano, Conor Hogan

Locked
martinspenke
Posts: 149
Joined: Tue Apr 08, 2014 6:05 am

blue gene

Post by martinspenke » Mon Feb 29, 2016 3:23 pm

Dear Developers,

Is YAMBO 4.0 rev. 100 stable enough to run on some 1000 cores on blue gene machine for production run ?
Up to how many cores there is still NO saturation effect for YAMBO 3.4.2 ?
Best wishes
Martin
Martin Spenke, PhD Student
Theoretisch-Physikalisches Institut
Universität Hamburg, Germany

User avatar
Daniele Varsano
Posts: 4209
Joined: Tue Mar 17, 2009 2:23 pm
Contact:

Re: blue gene

Post by Daniele Varsano » Mon Feb 29, 2016 3:27 pm

Dear Martin,
it depends on the runlevel. GW calculations has been run up to more than 16K cpu, without reaching saturation.
Bethe-Salpeter calculations are more delicate, we are actually investigating performance and optimizing in these days.

For the 3.4.2 I do not have a clear answer.

Best,
Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/

User avatar
Davide Sangalli
Posts: 641
Joined: Tue May 29, 2012 4:49 pm
Location: Via Salaria Km 29.3, CP 10, 00016, Monterotondo Stazione, Italy
Contact:

Re: blue gene

Post by Davide Sangalli » Mon Feb 29, 2016 5:15 pm

Dear Martin,
4.0 is still a testing version. As Daniele pointed out it depends on the runlevel.
The GW part has been extensively tested in parallel, the BSE, for example, much less.

For 3.4.2, if I remember correctly, the scaling was reasonable up to about 200 cores.
Above it was very bad. It was a result specific of a run-level and system dependent.
But I think it gives you an idea.

Best,
D.
Davide Sangalli, PhD
CNR-ISM, Division of Ultrafast Processes in Materials (FLASHit) and MaX Centre
https://sites.google.com/view/davidesangalli
http://www.max-centre.eu/

martinspenke
Posts: 149
Joined: Tue Apr 08, 2014 6:05 am

Re: blue gene

Post by martinspenke » Mon Feb 29, 2016 6:28 pm

Dear Daniele and Davide,

many thanks.
I will try GW and BSE on my system using yambo_4.0, and will come back.

Best wishes
Martin
Martin Spenke, PhD Student
Theoretisch-Physikalisches Institut
Universität Hamburg, Germany

martinspenke
Posts: 149
Joined: Tue Apr 08, 2014 6:05 am

Re: blue gene

Post by martinspenke » Fri Mar 04, 2016 7:30 am

Dear Daniele,

when i run yambo_4.0 on bluegene using for instance this command :
"yambo -F Inputs/02_QP_PPA_pure-mpi-q -J 02_QP_PPA_pure-mpi-q "

I obtain the error "UNKNOWN OPTION F" , "UNKNOWN OPTION J"

How should i run yambo_4.0 executable on bluegene machines properly ?

Bests
Martin
Martin Spenke, PhD Student
Theoretisch-Physikalisches Institut
Universität Hamburg, Germany

andrea.ferretti
Posts: 214
Joined: Fri Jan 31, 2014 11:13 am

Re: blue gene

Post by andrea.ferretti » Fri Mar 04, 2016 9:34 am

Hi Martin,

it seems that command line options are not recognized.
what is the actual command you use to run yambo and to provide the options ?

for instance, on BGQ, I use either one of the following two options:

Code: Select all

runjob  --np 1024 --ranks-per-node 8 --envs OMP_NUM_THREADS=4 : ./yambo -F yambo.in -J MYDIR
or
runjob  --np 1024 --ranks-per-node 8 --envs OMP_NUM_THREADS=4 --exe ./yambo  --args -F yambo.in  --args -J MYDIR
Andrea
Andrea Ferretti, PhD
CNR-NANO-S3 and MaX Centre
via Campi 213/A, 41125, Modena, Italy
Tel: +39 059 2055322; Skype: andrea_ferretti
URL: http://www.nano.cnr.it

martinspenke
Posts: 149
Joined: Tue Apr 08, 2014 6:05 am

Re: blue gene

Post by martinspenke » Mon Mar 07, 2016 6:31 am

Dear Andrea,

many thanks, it works for me, too.
Addition of "..args" solved the problem.

However, is on many cores (1024 - 2048 physical cores) the automatic task distribution not possible ?
and so avoiding to set numbers by hands for CPU in the input file ?

Best wishes
Martin
Martin Spenke, PhD Student
Theoretisch-Physikalisches Institut
Universität Hamburg, Germany

User avatar
Davide Sangalli
Posts: 641
Joined: Tue May 29, 2012 4:49 pm
Location: Via Salaria Km 29.3, CP 10, 00016, Monterotondo Stazione, Italy
Contact:

Re: blue gene

Post by Davide Sangalli » Mon Mar 07, 2016 8:51 am

Dear Martin,
you can try the automatic task distribution on many cores, but I fear it will not work.

Beside that, it is for sure not the most efficient way of running yambo on 1000 cores or more.
I would say the best way is to try with MPI+OpenMP parallelization setting in input the MPI strategy.

D.
Davide Sangalli, PhD
CNR-ISM, Division of Ultrafast Processes in Materials (FLASHit) and MaX Centre
https://sites.google.com/view/davidesangalli
http://www.max-centre.eu/

andrea.ferretti
Posts: 214
Joined: Fri Jan 31, 2014 11:13 am

Re: blue gene

Post by andrea.ferretti » Mon Mar 07, 2016 10:09 am

Dear Martin,

I agree pretty much with Davide's suggestion. When running on many cores it is best to set the parallelism by hand.
Moreover, on BGQ a hybrid scheme MPI+OpenMP is almost mandatory for yambo because of memory requirements.

I'll try to explain myself better:

on BGQ you have 16 GB RAM per node. First you need to estimate how much memory your system will take per MPI task,
say about 4GB (of course this depends on the parallelization, but we need a rough estimate).
Then, it means you can use up to 4 MPI tasks per node. In order not to waste performance, you would need to run a number of openmp threads per each MPI task in order to better exploit the cores you are not using. Because of architectural reasons, BGQ can sustain up to 64 threads per node, so you can use up to 16 threads per MPI task.

Coming to yambo: I would set nthreads=8 as a default, then you can increase it to 16 in the self-energy part (SE_threads), while I would lower it to 8 or 4 in the calculation of the response function (X_threads) because of a worse openmp scaling.

Regarding the MPI, you have the flexibility to distribute you MPI tasks (4 per node * number_of_nodes) into a number of levels.

concerning X: q and k distribute less memory, q does not communicate at all, both of them can lead to some degrees of load unbalance; therefore I would try to use c and v parallelism as much as possible, resorting to q and k when needed (after all these levels of parallelisms are not that bad either)

concerning sigma: I would avoid q parallelism if possible (load unbalance), use qp as much as possible, and then start with b. If you have memory issues, b distributes memory at best (as c and v do for X)

hope this helps
Andrea
Andrea Ferretti, PhD
CNR-NANO-S3 and MaX Centre
via Campi 213/A, 41125, Modena, Italy
Tel: +39 059 2055322; Skype: andrea_ferretti
URL: http://www.nano.cnr.it

Locked