Haydock in BSE - segmentation fault

Various technical topics such as parallelism and efficiency, netCDF problems, the Yambo code structure itself, are posted here.

Moderators: Davide Sangalli, andrea.ferretti, myrta gruning, andrea marini, Daniele Varsano, Conor Hogan, Nicola Spallanzani

Post Reply
User avatar
malwi
Posts: 36
Joined: Mon Feb 29, 2016 1:00 pm

Haydock in BSE - segmentation fault

Post by malwi » Fri Oct 27, 2023 10:00 am

Dear Team,

I got the error in BSE with SOC in the last step - Haydock diagonalization - the slurm is listed below.
LOG file end is also below.
It happens both with the versions with "time-profile" and "no-time-profile".

Best regards,
Malgorzata

==============================
GCCcore/11.3.0 loaded.
zlib/1.2.12 loaded.
binutils/2.38 loaded.
numactl/2.0.14 loaded.
CUDA/11.7.0 loaded.
NVHPC/22.11-CUDA-11.7.0 loaded.
XZ/5.2.5 loaded.
libxml2/2.9.13 loaded.
libpciaccess/0.16 loaded.
hwloc/2.7.1 loaded.
OpenSSL/1.1 loaded.
libevent/2.1.12 loaded.
UCX/1.12.1 loaded.
GDRCopy/2.3 loaded.
UCX-CUDA/1.12.1-CUDA-11.7.0 loaded.
libfabric/1.15.1 loaded.
PMIx/4.1.2 loaded.
UCC/1.0.0 loaded.
NCCL/2.12.12-CUDA-11.7.0 loaded.
UCC-CUDA/1.0.0-CUDA-11.7.0 loaded.
OpenMPI/4.1.4 loaded.
Yambo/5.1.1-991f327-no-time-profile loaded.
[t0024:2940403:0:2940403] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xffffffffc57dc2d0)
[t0024:2940407:0:2940407] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xffffffffc54bbcb0)
[t0024:2940401:0:2940401] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xffffffffc61a92d0)
[t0024:2940402:0:2940402] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xffffffffc63e1e60)
[t0024:2940408:0:2940408] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xffffffffc46b8c10)
[t0024:2940404:0:2940404] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xffffffffc4de0380)
[t0024:2940405:0:2940405] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xffffffffc47aff00)
[t0024:2940406:0:2940406] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xffffffffc6e419d0)
==== backtrace (tid:2940403) ====
0 0x0000000000054df0 __GI___sigaction() :0
1 0x0000000000628bf4 sym_init_table() /memfs/462823/Yambo/5.1.1/NVHPC-22.11-CUDA-11.7.0-991f327-no-time-profile/yambo/src/parser/PARSER_symbols.c:44
2 0x0000000000626f5b parse_init() /memfs/462823/Yambo/5.1.1/NVHPC-22.11-CUDA-11.7.0-991f327-no-time-profile/yambo/src/parser/PARSER.c:71
3 0x0000000000626c06 iparse_init_() /memfs/462823/Yambo/5.1.1/NVHPC-22.11-CUDA-11.7.0-991f327-no-time-profile/yambo/src/parser/PARSER_interface.c:31
4 0x00000000006184a3 it_tools_it_reset_() /memfs/462823/Yambo/5.1.1/NVHPC-22.11-CUDA-11.7.0-991f327-no-time-profile/yambo/src/parser/mod_it_tools.f90:61
==============================

LOG fie end

<35s> P1: Loading full BSE kernel |###### | [015%] 26s(E) 02m-55s(X)
<44s> P1: Loading full BSE kernel |####### | [017%] 35s(E) 03m-23s(X)
<57s> P1: Loading full BSE kernel |######## | [020%] 48s(E) 04m-04s(X)
<01m-21s> P1: Loading full BSE kernel |######### | [022%] 01m-12s(E) 05m-22s(X)
<01m-37s> P1: Loading full BSE kernel |########## | [025%] 01m-28s(E) 05m-53s(X)
<01m-57s> P1: Loading full BSE kernel |########### | [027%] 01m-48s(E) 06m-34s(X)
<02m-28s> P1: Loading full BSE kernel |############ | [030%] 02m-19s(E) 07m-45s(X)
<03m-13s> P1: Loading full BSE kernel |############# | [032%] 03m-04s(E) 09m-28s(X)
<04m-21s> P1: Loading full BSE kernel |############## | [035%] 04m-12s(E) 12m-00s(X)
<05m-28s> P1: Loading full BSE kernel |############### | [037%] 05m-19s(E) 14m-11s(X)
<06m-06s> P1: Loading full BSE kernel |################ | [040%] 05m-57s(E) 14m-53s(X)
<06m-58s> P1: Loading full BSE kernel |################# | [042%] 06m-49s(E) 16m-04s(X)
<08m-27s> P1: Loading full BSE kernel |################## | [045%] 08m-19s(E) 18m-28s(X)
<09m-17s> P1: Loading full BSE kernel |################### | [047%] 09m-08s(E) 19m-15s(X)
<10m-24s> P1: Loading full BSE kernel |#################### | [050%] 10m-16s(E) 20m-31s(X)
<11m-25s> P1: Loading full BSE kernel |##################### | [052%] 11m-16s(E) 21m-29s(X)
<12m-37s> P1: Loading full BSE kernel |###################### | [055%] 12m-29s(E) 22m-41s(X)
<13m-39s> P1: Loading full BSE kernel |####################### | [057%] 13m-30s(E) 23m-29s(X)
<15m-23s> P1: Loading full BSE kernel |######################## | [060%] 15m-14s(E) 25m-23s(X)
<17m-32s> P1: Loading full BSE kernel |######################### | [062%] 17m-23s(E) 27m-50s(X)
<19m-53s> P1: Loading full BSE kernel |########################## | [065%] 19m-44s(E) 30m-22s(X)
<21m-50s> P1: Loading full BSE kernel |########################### | [067%] 21m-42s(E) 32m-09s(X)
<24m-31s> P1: Loading full BSE kernel |############################ | [070%] 24m-22s(E) 34m-48s(X)
<27m-27s> P1: Loading full BSE kernel |############################# | [072%] 27m-18s(E) 37m-40s(X)
<29m-29s> P1: Loading full BSE kernel |############################## | [075%] 29m-20s(E) 39m-07s(X)
<32m-11s> P1: Loading full BSE kernel |############################### | [077%] 32m-02s(E) 41m-20s(X)
<33m-23s> P1: Loading full BSE kernel |################################ | [080%] 33m-14s(E) 41m-33s(X)
<35m-25s> P1: Loading full BSE kernel |################################# | [082%] 35m-16s(E) 42m-45s(X)
<37m-08s> P1: Loading full BSE kernel |################################## | [085%] 36m-59s(E) 43m-31s(X)
<38m-49s> P1: Loading full BSE kernel |################################### | [087%] 38m-41s(E) 44m-12s(X)
<39m-59s> P1: Loading full BSE kernel |#################################### | [090%] 39m-50s(E) 44m-16s(X)
<41m-32s> P1: Loading full BSE kernel |##################################### | [092%] 41m-23s(E) 44m-45s(X)
<42m-12s> P1: Loading full BSE kernel |###################################### | [095%] 42m-03s(E) 44m-16s(X)
<42m-37s> P1: Loading full BSE kernel |####################################### | [097%] 42m-29s(E) 43m-34s(X)
<42m-45s> P1: Loading full BSE kernel |########################################| [100%] 42m-37s(E) 42m-37s(X)
<46m-39s> P1: [05.02] BSE solver(s) @q1
<46m-39s> P1: [05.03] Haydock Solver in the optics basis @q1 using the hermitian scheme
===================================================
dr hab. Małgorzata Wierzbowska, Prof. IHPP PAS
Institute of High Pressure Physics Polish Academy of Sciences
Warsaw, Poland

User avatar
Daniele Varsano
Posts: 3975
Joined: Tue Mar 17, 2009 2:23 pm
Contact:

Re: Haydock in BSE - segmentation fault

Post by Daniele Varsano » Mon Oct 30, 2023 11:26 am

Dear Gosia,
can you attach your input and report files? You can use the attachments function below the message and add files after renaming the suffix (e.g. input.txt, report.txt).
Best,
Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/

User avatar
malwi
Posts: 36
Joined: Mon Feb 29, 2016 1:00 pm

Re: Haydock in BSE - segmentation fault

Post by malwi » Tue Oct 31, 2023 12:31 am

Dear Daniele,
thank you. I attach the files. It was run with 8 cpu and 8 gpu, 1 thread per cpu.
Gosia
You do not have the required permissions to view the files attached to this post.
dr hab. Małgorzata Wierzbowska, Prof. IHPP PAS
Institute of High Pressure Physics Polish Academy of Sciences
Warsaw, Poland

User avatar
Daniele Varsano
Posts: 3975
Joined: Tue Mar 17, 2009 2:23 pm
Contact:

Re: Haydock in BSE - segmentation fault

Post by Daniele Varsano » Thu Nov 02, 2023 10:01 am

Dear Gosia,
not easy to spot the problem!

The only thing I can see is that in your input file you are not including any QP correction (nor from database, nor as scissor operator).
BSE on top of KS can lead to negative excitation energy. In this case, I'm not sure the haydock solver in hermitian scheme is able to handle this. To verify if this is actually the case, can you add a QP scissor correction by hand and see if yambo runs without error?

Best,
Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/

User avatar
malwi
Posts: 36
Joined: Mon Feb 29, 2016 1:00 pm

Re: Haydock in BSE - segmentation fault

Post by malwi » Fri Nov 03, 2023 9:47 pm

Dear Daniele,

This run is GaN (4 atoms in the cell) with SOC. I know where is the first peak, because now I make the third run with more and more dense k-mesh.
Previous calculations with less k-points went well. I got the Haydock results for this system when I had 131 k-points in IBZ.
Now it failed when I have 315 k-points in the IBZ. I have "force_symmorphic = .true."

Another run for this system without SOC went well with 627 kpoints in IBZ and failed at Haydock for 1103 k-points.

I am looking at the parallelization and try to change the cpu distribution, still having only 8 cpu and 8 gpu in total.
Maciej Czuchry suggested using "ulimit -s unlimited", but it did not help.

If you have any other idea.... thanks :-)
Gosia
dr hab. Małgorzata Wierzbowska, Prof. IHPP PAS
Institute of High Pressure Physics Polish Academy of Sciences
Warsaw, Poland

Post Reply