blue gene

martinspenke · Post by **martinspenke** » Mon Feb 29, 2016 3:23 pm

Dear Developers,

Is YAMBO 4.0 rev. 100 stable enough to run on some 1000 cores on blue gene machine for production run ?
Up to how many cores there is still NO saturation effect for YAMBO 3.4.2 ?
Best wishes
Martin

Post by **Daniele Varsano** » Mon Feb 29, 2016 3:27 pm

Dear Martin,
it depends on the runlevel. GW calculations has been run up to more than 16K cpu, without reaching saturation.
Bethe-Salpeter calculations are more delicate, we are actually investigating performance and optimizing in these days.

For the 3.4.2 I do not have a clear answer.

Best,
Daniele

Post by **Davide Sangalli** » Mon Feb 29, 2016 5:15 pm

Dear Martin,
4.0 is still a testing version. As Daniele pointed out it depends on the runlevel.
The GW part has been extensively tested in parallel, the BSE, for example, much less.

For 3.4.2, if I remember correctly, the scaling was reasonable up to about 200 cores.
Above it was very bad. It was a result specific of a run-level and system dependent.
But I think it gives you an idea.

Best,
D.

martinspenke · Post by **martinspenke** » Mon Feb 29, 2016 6:28 pm

Dear Daniele and Davide,

many thanks.
I will try GW and BSE on my system using yambo_4.0, and will come back.

Best wishes
Martin

martinspenke · Post by **martinspenke** » Fri Mar 04, 2016 7:30 am

Dear Daniele,

when i run yambo_4.0 on bluegene using for instance this command :
"yambo -F Inputs/02_QP_PPA_pure-mpi-q -J 02_QP_PPA_pure-mpi-q "

I obtain the error "UNKNOWN OPTION F" , "UNKNOWN OPTION J"

How should i run yambo_4.0 executable on bluegene machines properly ?

Bests
Martin

Post by **andrea.ferretti** » Fri Mar 04, 2016 9:34 am

Hi Martin,

it seems that command line options are not recognized.
what is the actual command you use to run yambo and to provide the options ?

for instance, on BGQ, I use either one of the following two options:

Code: Select all

runjob  --np 1024 --ranks-per-node 8 --envs OMP_NUM_THREADS=4 : ./yambo -F yambo.in -J MYDIR
or
runjob  --np 1024 --ranks-per-node 8 --envs OMP_NUM_THREADS=4 --exe ./yambo  --args -F yambo.in  --args -J MYDIR

Andrea

martinspenke · Post by **martinspenke** » Mon Mar 07, 2016 6:31 am

Dear Andrea,

many thanks, it works for me, too.
Addition of "..args" solved the problem.

However, is on many cores (1024 - 2048 physical cores) the automatic task distribution not possible ?
and so avoiding to set numbers by hands for CPU in the input file ?

Best wishes
Martin

Post by **Davide Sangalli** » Mon Mar 07, 2016 8:51 am

Dear Martin,
you can try the automatic task distribution on many cores, but I fear it will not work.

Beside that, it is for sure not the most efficient way of running yambo on 1000 cores or more.
I would say the best way is to try with MPI+OpenMP parallelization setting in input the MPI strategy.

D.

Post by **andrea.ferretti** » Mon Mar 07, 2016 10:09 am

Dear Martin,

I agree pretty much with Davide's suggestion. When running on many cores it is best to set the parallelism by hand.
Moreover, on BGQ a hybrid scheme MPI+OpenMP is almost mandatory for yambo because of memory requirements.

I'll try to explain myself better:

on BGQ you have 16 GB RAM per node. First you need to estimate how much memory your system will take per MPI task,
say about 4GB (of course this depends on the parallelization, but we need a rough estimate).
Then, it means you can use up to 4 MPI tasks per node. In order not to waste performance, you would need to run a number of openmp threads per each MPI task in order to better exploit the cores you are not using. Because of architectural reasons, BGQ can sustain up to 64 threads per node, so you can use up to 16 threads per MPI task.

Coming to yambo: I would set nthreads=8 as a default, then you can increase it to 16 in the self-energy part (SE_threads), while I would lower it to 8 or 4 in the calculation of the response function (X_threads) because of a worse openmp scaling.

Regarding the MPI, you have the flexibility to distribute you MPI tasks (4 per node * number_of_nodes) into a number of levels.

concerning X: q and k distribute less memory, q does not communicate at all, both of them can lead to some degrees of load unbalance; therefore I would try to use c and v parallelism as much as possible, resorting to q and k when needed (after all these levels of parallelisms are not that bad either)

concerning sigma: I would avoid q parallelism if possible (load unbalance), use qp as much as possible, and then start with b. If you have memory issues, b distributes memory at best (as c and v do for X)

hope this helps
Andrea

Yambo Community Forum

blue gene

blue gene

Re: blue gene

Re: blue gene

Re: blue gene

Re: blue gene

Re: blue gene

Re: blue gene

Re: blue gene

Re: blue gene