Page 1 of 1

blue gene

Posted: Mon Feb 29, 2016 3:23 pm
by martinspenke
Dear Developers,

Is YAMBO 4.0 rev. 100 stable enough to run on some 1000 cores on blue gene machine for production run ?
Up to how many cores there is still NO saturation effect for YAMBO 3.4.2 ?
Best wishes
Martin

Re: blue gene

Posted: Mon Feb 29, 2016 3:27 pm
by Daniele Varsano
Dear Martin,
it depends on the runlevel. GW calculations has been run up to more than 16K cpu, without reaching saturation.
Bethe-Salpeter calculations are more delicate, we are actually investigating performance and optimizing in these days.

For the 3.4.2 I do not have a clear answer.

Best,
Daniele

Re: blue gene

Posted: Mon Feb 29, 2016 5:15 pm
by Davide Sangalli
Dear Martin,
4.0 is still a testing version. As Daniele pointed out it depends on the runlevel.
The GW part has been extensively tested in parallel, the BSE, for example, much less.

For 3.4.2, if I remember correctly, the scaling was reasonable up to about 200 cores.
Above it was very bad. It was a result specific of a run-level and system dependent.
But I think it gives you an idea.

Best,
D.

Re: blue gene

Posted: Mon Feb 29, 2016 6:28 pm
by martinspenke
Dear Daniele and Davide,

many thanks.
I will try GW and BSE on my system using yambo_4.0, and will come back.

Best wishes
Martin

Re: blue gene

Posted: Fri Mar 04, 2016 7:30 am
by martinspenke
Dear Daniele,

when i run yambo_4.0 on bluegene using for instance this command :
"yambo -F Inputs/02_QP_PPA_pure-mpi-q -J 02_QP_PPA_pure-mpi-q "

I obtain the error "UNKNOWN OPTION F" , "UNKNOWN OPTION J"

How should i run yambo_4.0 executable on bluegene machines properly ?

Bests
Martin

Re: blue gene

Posted: Fri Mar 04, 2016 9:34 am
by andrea.ferretti
Hi Martin,

it seems that command line options are not recognized.
what is the actual command you use to run yambo and to provide the options ?

for instance, on BGQ, I use either one of the following two options:

Code: Select all

runjob  --np 1024 --ranks-per-node 8 --envs OMP_NUM_THREADS=4 : ./yambo -F yambo.in -J MYDIR
or
runjob  --np 1024 --ranks-per-node 8 --envs OMP_NUM_THREADS=4 --exe ./yambo  --args -F yambo.in  --args -J MYDIR
Andrea

Re: blue gene

Posted: Mon Mar 07, 2016 6:31 am
by martinspenke
Dear Andrea,

many thanks, it works for me, too.
Addition of "..args" solved the problem.

However, is on many cores (1024 - 2048 physical cores) the automatic task distribution not possible ?
and so avoiding to set numbers by hands for CPU in the input file ?

Best wishes
Martin

Re: blue gene

Posted: Mon Mar 07, 2016 8:51 am
by Davide Sangalli
Dear Martin,
you can try the automatic task distribution on many cores, but I fear it will not work.

Beside that, it is for sure not the most efficient way of running yambo on 1000 cores or more.
I would say the best way is to try with MPI+OpenMP parallelization setting in input the MPI strategy.

D.

Re: blue gene

Posted: Mon Mar 07, 2016 10:09 am
by andrea.ferretti
Dear Martin,

I agree pretty much with Davide's suggestion. When running on many cores it is best to set the parallelism by hand.
Moreover, on BGQ a hybrid scheme MPI+OpenMP is almost mandatory for yambo because of memory requirements.

I'll try to explain myself better:

on BGQ you have 16 GB RAM per node. First you need to estimate how much memory your system will take per MPI task,
say about 4GB (of course this depends on the parallelization, but we need a rough estimate).
Then, it means you can use up to 4 MPI tasks per node. In order not to waste performance, you would need to run a number of openmp threads per each MPI task in order to better exploit the cores you are not using. Because of architectural reasons, BGQ can sustain up to 64 threads per node, so you can use up to 16 threads per MPI task.

Coming to yambo: I would set nthreads=8 as a default, then you can increase it to 16 in the self-energy part (SE_threads), while I would lower it to 8 or 4 in the calculation of the response function (X_threads) because of a worse openmp scaling.

Regarding the MPI, you have the flexibility to distribute you MPI tasks (4 per node * number_of_nodes) into a number of levels.

concerning X: q and k distribute less memory, q does not communicate at all, both of them can lead to some degrees of load unbalance; therefore I would try to use c and v parallelism as much as possible, resorting to q and k when needed (after all these levels of parallelisms are not that bad either)

concerning sigma: I would avoid q parallelism if possible (load unbalance), use qp as much as possible, and then start with b. If you have memory issues, b distributes memory at best (as c and v do for X)

hope this helps
Andrea