an error message at the GW level
Moderators: Davide Sangalli, andrea.ferretti, myrta gruning, andrea marini, Daniele Varsano, Conor Hogan, Nicola Spallanzani
-
- Posts: 19
- Joined: Sun Dec 03, 2017 10:24 am
- Location: Algeria
an error message at the GW level
I have done a calculation GW (yambo-4.1.2) for MoSe2 monolayers in a cluster using two (2) nodes * 2 tasks (4 CPU in total), this calculation works well but it will take a lot of time, since the number of nodes is limited in our cluster, I tried to perform the same calculates with 16 CPU (4nodes * 4tasks), after two days of launching this work I encountered the following error message: mpirun noticed that process rank 0 with PID 25006 on node ibnbadis15 exited on signal 9 (Killed).
is there any possibility to overcome it!
my best regards
foudil zaabar
university of bejaia
Algeria
is there any possibility to overcome it!
my best regards
foudil zaabar
university of bejaia
Algeria
- Daniele Varsano
- Posts: 3816
- Joined: Tue Mar 17, 2009 2:23 pm
- Contact:
Re: an error message at the GW level
Dear foudil zaabar,
this message does not come from yambo but from your queue system. It is possible that your job reached the maximum wall time allowed. You can inspect if this is the case if you have any standard error or standard output message.
In any case if you post your input, report and log file we can have a look to see if there is some problem and/or possibility to optimize the calculation.
Best,
Daniele
this message does not come from yambo but from your queue system. It is possible that your job reached the maximum wall time allowed. You can inspect if this is the case if you have any standard error or standard output message.
In any case if you post your input, report and log file we can have a look to see if there is some problem and/or possibility to optimize the calculation.
Best,
Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/
-
- Posts: 19
- Joined: Sun Dec 03, 2017 10:24 am
- Location: Algeria
Re: an error message at the GW level
Dear Daniele
Thank you for your help,
here you find all the files you asked me , I put the slurm script in the same file input (yambo.in)
thank you in advance
zaabar foudil
university of bejaia
Algeria
Thank you for your help,
here you find all the files you asked me , I put the slurm script in the same file input (yambo.in)
thank you in advance
zaabar foudil
university of bejaia
Algeria
You do not have the required permissions to view the files attached to this post.
- Daniele Varsano
- Posts: 3816
- Joined: Tue Mar 17, 2009 2:23 pm
- Contact:
Re: an error message at the GW level
Dear Zaabar,
It is possible your job died for memory reason:
Please note that actually, you run is using more than 5Gb and the code seems to have died after 18 seconds.
You can check the memory allocated in the log file:
Possible strategies to solve the issue are:
1) #SBATCH --mem=30000
is this the max mem available for the node? Be sure of that and set to the maximum.
2) Reduce your FFTGvec in the calculation, you can always do that but not that much:
Add in your input the variable e.g.:
probably you can reduce it even further but remember to check the accuracy of the final results with respect the value you put there.
3) If this still does not solve the problem you can think about splitting your calculation in multiple lighter runs:
e.g.
Run1:
Run2:
These will generate two QP databases that can be merged with the ypp utility.
If you need to do so pay attention to not overwrite these databases e.g. you can add the following line at the end of your slurm script.
mv ./SAVE/ndb.QP ./SAVE/ndb.QP_1 for the first
mv ./SAVE/ndb.QP ./SAVE/ndb.QP_2 for the second
Here other suggestion not related to the memory issue:
1)
CUTGeo= "Z" this is not a valid keyword replace with
in your calculation actually, the coulomb cutoff technique is not used.
2)
you cannot use more than 100 bands as they are the maximum value in your database (nscf calculation). If you need more bands you need to calculate them with QE (nscf).
3) you can also consider to use terminators to accelerate bands convergence by adding this two lines in input:
Best,
Daniele
It is possible your job died for memory reason:
Please note that actually, you run is using more than 5Gb and the code seems to have died after 18 seconds.
You can check the memory allocated in the log file:
Code: Select all
<18s> P0008: [M 5.096 Gb] Alloc wf_disk ( 0.025)
1) #SBATCH --mem=30000
is this the max mem available for the node? Be sure of that and set to the maximum.
2) Reduce your FFTGvec in the calculation, you can always do that but not that much:
Add in your input the variable e.g.:
Code: Select all
FFTGvecs=70000
3) If this still does not solve the problem you can think about splitting your calculation in multiple lighter runs:
e.g.
Run1:
Code: Select all
%QPkrange # [GW] QP generalized Kpoint/Band indices
1|127 | 43|46|
%
Code: Select all
%QPkrange # [GW] QP generalized Kpoint/Band indices
1|127 | 47|50|
%
If you need to do so pay attention to not overwrite these databases e.g. you can add the following line at the end of your slurm script.
mv ./SAVE/ndb.QP ./SAVE/ndb.QP_1 for the first
mv ./SAVE/ndb.QP ./SAVE/ndb.QP_2 for the second
Here other suggestion not related to the memory issue:
1)
CUTGeo= "Z" this is not a valid keyword replace with
Code: Select all
CUTGeo= "box Z"
2)
Code: Select all
% BndsRnXp
1 | 130 | # [Xp] Polarization function bands
%
% GbndRnge
1 | 130 | # [GW] G[W] bands range
%
3) you can also consider to use terminators to accelerate bands convergence by adding this two lines in input:
Code: Select all
GTermKind= "BG" # [GW] GW terminator ("none","BG" Bruneval-Gonze)
GTermEn= 40.81708 eV # [GW] GW terminator energy (only for kind="BG")
Best,
Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/
-
- Posts: 19
- Joined: Sun Dec 03, 2017 10:24 am
- Location: Algeria
Re: an error message at the GW level
Dear Daniele
I thank you very much for your advises, I start the calculations according to different instruction as you showed me
it works well until now. .
my best regards
zaabar foudil
university of bejaia
Algeria
I thank you very much for your advises, I start the calculations according to different instruction as you showed me
it works well until now. .
my best regards
zaabar foudil
university of bejaia
Algeria
-
- Posts: 19
- Joined: Sun Dec 03, 2017 10:24 am
- Location: Algeria
Re: an error message at the GW level
Dear Daniele
I want to perform the previous calculations (GW, BSE) that I showed you in EC2 instance (cloud amazon), I installed yambo-4.1.2 (parallel version) it works well with the command mpirun -n 8 ... .... / bin / yambo, it displays the LOG folder and the other output files, r_optique, but when I close the terminal whose I have typed the command (mpirun -n ....) or disconnects the calculation s 'stopped ...
Is there any script or a solution to overcome that?
my best regards
zaabar foudil
university of bejaia
Algeria
I want to perform the previous calculations (GW, BSE) that I showed you in EC2 instance (cloud amazon), I installed yambo-4.1.2 (parallel version) it works well with the command mpirun -n 8 ... .... / bin / yambo, it displays the LOG folder and the other output files, r_optique, but when I close the terminal whose I have typed the command (mpirun -n ....) or disconnects the calculation s 'stopped ...
Is there any script or a solution to overcome that?
my best regards
zaabar foudil
university of bejaia
Algeria
- Daniele Varsano
- Posts: 3816
- Joined: Tue Mar 17, 2009 2:23 pm
- Contact:
Re: an error message at the GW level
Dear Zaabar,
I do not know at all the cloud amazon environment, but from what you say you can try to run the job as:
nohup mpirun -n 8 ... .... /bin/yambo &
Have a look e.g. here for the nohup usage:
https://www.cyberciti.biz/tips/nohup-ex ... rompt.html
or just search "nohup" in google.
Best,
Daniele
I do not know at all the cloud amazon environment, but from what you say you can try to run the job as:
nohup mpirun -n 8 ... .... /bin/yambo &
Have a look e.g. here for the nohup usage:
https://www.cyberciti.biz/tips/nohup-ex ... rompt.html
or just search "nohup" in google.
Best,
Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/
-
- Posts: 19
- Joined: Sun Dec 03, 2017 10:24 am
- Location: Algeria
Re: an error message at the GW level
Dear Daniele
I tested the command " nohup ", now the problem is solved,
I thank you very much
my best regards
foudil zaabar
university of bejaia
Algeria
I tested the command " nohup ", now the problem is solved,
I thank you very much
my best regards
foudil zaabar
university of bejaia
Algeria