Large memory and disk usage.

Having trouble compiling the Yambo source? Using an unusual architecture? Problems with the "configure" script? Problems in GPU architectures? This is the place to look.

Moderators: Davide Sangalli, andrea.ferretti, myrta gruning, andrea marini, Daniele Varsano, Conor Hogan, Nicola Spallanzani

Forum rules
If you have trouble compiling Yambo, please make sure to list:
(1) the compiler (vendor and release: e.g. intel 10.1)
(2) the architecture (e.g. 64-bit IBM SP5)
(3) if the problems occur compiling in serial/in parallel
(4) the version of Yambo (revision number/major release version)
(5) the relevant compiler error message
Post Reply
chinaye
Posts: 16
Joined: Wed Mar 25, 2009 9:35 am

Large memory and disk usage.

Post by chinaye » Sun Nov 01, 2009 7:48 am

Dear developers,
we now calculate the absorption spectra of relatively large systems. The SAVE folder may exceed 200 G or even 800 G and the memory used per process will be 7-8 G. We find that the calculation will stop when forming the BSK (for instance, it actually stops when the BSK is 45% completed, however, the CPU is still busy). If we restart the job, the same problem will appear again. We want to know whether yambo is capable of handling very large memory and disk usage at present. Can this problem be solved by compiling yambo using more reasonable environment variables. We find that the executable file ypp dose not appear, while there is no errors reported in the make process. Should we revise the Makefile? Thank you very much.
Jianfei Ye, PHD
Shanghai Institute of Ceramic Chinese Academy of Science (SICCAS)
Email:jianfeiye@mail.sic.ac.cn

User avatar
myrta gruning
Posts: 240
Joined: Tue Mar 17, 2009 11:38 am
Contact:

Re: Large memory and disk usage.

Post by myrta gruning » Sun Nov 01, 2009 1:13 pm

chinaye wrote: we now calculate the absorption spectra of relatively large systems. The SAVE folder may exceed 200 G or even 800 G and the memory used per process will be 7-8 G. We find that the calculation will stop when forming the BSK (for instance, it actually stops when the BSK is 45% completed, however, the CPU is still busy). If we restart the job, the same problem will appear again. We want to know whether yambo is capable of handling very large memory and disk usage at present. Can this problem be solved by compiling yambo using more reasonable environment variables.
Dear Jianfei Ye,

Does any error message appear either in the report, log, standard err/out? [e.g. a time out because of slow/no communication between nodes]
Are you sure it is not a problem of the job exceeding the available mem?
Also, sometimes (depending on the machine) the log is updated only from time to time? Are you sure the job is really doing nothing (you say CPUs is still busy, also the memory is still occupied by the processes?), or is simply the log that is not written for a long time?

The only restriction I know for large disk usage is with netcdf. Are you using netcdf? By default Netcdf cannot work with file larger of 2 Gb. Do you have databases larger than that? If this is the case you may need to reconfigure yambo enabling large db (see ./cofigure --help) and recompiling.

Please try to give as many details as may be needed to understand the problem.

Now coming to your second point:
chinaye wrote: We find that the executable file ypp dose not appear, while there is no errors reported in the make process. Should we revise the Makefile?
In principle the Makefile coming out from the configure command should be fine, with no need of revising it.

Are you using
make all
or
make ypp?

Can you post the output of the
"make all" or
"make ypp" command
together with all the details (see forum rules)?
(to be sure work from "scratch, by typing "make clean" before, note this will clean all executable in /bin dir as well)

Regards,
Myrta
Dr Myrta Grüning
School of Mathematics and Physics
Queen's University Belfast - Northern Ireland

http://www.researcherid.com/rid/B-1515-2009

chinaye
Posts: 16
Joined: Wed Mar 25, 2009 9:35 am

Re: Large memory and disk usage.

Post by chinaye » Mon Nov 02, 2009 2:43 am

Dear Myrta,
thank you for your attention.
1) Are you sure it is not a problem of the job exceeding the available mem?
Also, sometimes (depending on the machine) the log is updated only from time to time? Are you sure the job is really doing nothing (you say CPUs is still busy, also the memory is still occupied by the processes?), or is simply the log that is not written for a long time?
There is no err report, since the job is not terminated. Here are two files CPU and
l_optics_bse_em1s_bss. The former indicates that the cpu is busy and the memory is not exceeded. The job has run for about 72 hours. l_optics_bse_em1s_bss has not been updated for at least 16 hours (we estimated). The BSK is 90% completed.
We find that forming 10% of BSK costs nearly 3 hours.
CPU:
top - 09:56:58 up 11 days, 19:38, 1 user, load average: 15.00, 15.00, 15.00
Tasks: 217 total, 17 running, 200 sleeping, 0 stopped, 0 zombie
Cpu(s): 93.7%us, 0.0%sy, 0.0%ni, 6.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 32962108k total, 31176764k used, 1785344k free, 219984k buffers
Swap: 4200988k total, 2634256k used, 1566732k free, 6310732k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
23317 jfye 25 0 2309m 1.5g 4564 R 100 4.9 4345:33 yambo
23307 jfye 25 0 2309m 1.5g 4404 R 100 4.8 4345:00 yambo
23310 jfye 25 0 2309m 1.7g 4524 R 100 5.4 4343:16 yambo
23311 jfye 25 0 2309m 1.3g 4360 R 100 4.3 4343:33 yambo
23313 jfye 25 0 2309m 1.7g 4376 R 100 5.3 4344:41 yambo
23314 jfye 25 0 2309m 1.7g 4540 R 100 5.4 4343:23 yambo
23316 jfye 25 0 2309m 1.7g 4452 R 100 5.3 4344:49 yambo
23319 jfye 25 0 2309m 1.3g 4548 R 100 4.1 4343:47 yambo
23320 jfye 25 0 2309m 1.7g 4576 R 100 5.4 4343:22 yambo
23306 jfye 25 0 2309m 1.7g 4280 R 100 5.3 4344:51 yambo
23308 jfye 25 0 2309m 1.5g 4532 R 100 4.9 4345:28 yambo
23309 jfye 25 0 2309m 1.3g 4336 R 100 4.2 4342:30 yambo
23312 jfye 25 0 2309m 1.5g 4508 R 100 4.6 4345:29 yambo
23315 jfye 25 0 2309m 1.7g 4524 R 100 5.3 4344:40 yambo
23318 jfye 25 0 2309m 1.3g 4332 R 100 4.2 4343:48 yambo
7498 jfye 16 0 5652 1288 868 R 0 0.0 0:00.04 top

l_optics_bse_em1s_bss:
......
<01d-06h-35m-15s> P02: BSK |################ | [080%] 01d-05h-28m-34s(E) 01d-12h-50m-43s(X)
<01d-06h-35m-05s> P09: BSK |################ | [080%] 01d-05h-28m-24s(E) 01d-12h-50m-30s(X)
<01d-06h-35m-07s> P11: BSK |################ | [080%] 01d-05h-28m-26s(E) 01d-12h-50m-33s(X)
<01d-06h-35m-13s> P12: BSK |################ | [080%] 01d-05h-28m-32s(E) 01d-12h-50m-40s(X)
<01d-06h-34m-59s> P07: BSK |################ | [080%] 01d-05h-28m-21s(E) 01d-12h-50m-26s(X)
<01d-06h-35m-10s> P14: BSK |################ | [080%] 01d-05h-28m-29s(E) 01d-12h-50m-37s(X)
<01d-06h-34m-18s> P16: BSK |################ | [080%] 01d-05h-27m-37s(E) 01d-12h-49m-32s(X)
<01d-06h-34m-20s> P10: BSK |################ | [080%] 01d-05h-27m-39s(E) 01d-12h-49m-33s(X)
<01d-06h-35m-13s> P15: BSK |################ | [080%] 01d-05h-28m-32s(E) 01d-12h-50m-40s(X)
<01d-06h-36m-28s> P08: BSK |################ | [080%] 01d-05h-29m-47s(E) 01d-12h-52m-14s(X)
<01d-06h-36m-29s> P04: BSK |################ | [080%] 01d-05h-29m-48s(E) 01d-12h-52m-15s(X)
<01d-06h-33m-52s> P01: BSK |################ | [080%] 01d-05h-27m-11s(E) 01d-12h-48m-59s(X)
<01d-06h-36m-09s> P03: BSK |################ | [080%] 01d-05h-29m-28s(E) 01d-12h-51m-50s(X)
<01d-06h-33m-59s> P05: BSK |################ | [080%] 01d-05h-27m-22s(E) 01d-12h-49m-13s(X)
<01d-06h-36m-31s> P13: BSK |################ | [080%] 01d-05h-29m-50s(E) 01d-12h-52m-17s(X)
<01d-06h-34m-13s> P06: BSK |################ | [080%] 01d-05h-27m-32s(E) 01d-12h-49m-25s(X)
<01d-09h-19m-58s> P02: BSK |################# | [085%] 01d-08h-13m-17s(E) 01d-13h-54m-27s(X)
<01d-09h-19m-47s> P09: BSK |################# | [085%] 01d-08h-13m-06s(E) 01d-13h-54m-14s(X)
<01d-09h-19m-56s> P12: BSK |################# | [085%] 01d-08h-13m-15s(E) 01d-13h-54m-24s(X)
<01d-09h-19m-50s> P11: BSK |################# | [085%] 01d-08h-13m-09s(E) 01d-13h-54m-18s(X)
<01d-09h-19m-37s> P14: BSK |################# | [085%] 01d-08h-12m-56s(E) 01d-13h-54m-02s(X)
<01d-09h-19m-21s> P07: BSK |################# | [085%] 01d-08h-12m-42s(E) 01d-13h-53m-46s(X)
<01d-09h-18m-29s> P05: BSK |################# | [085%] 01d-08h-11m-53s(E) 01d-13h-52m-48s(X)
<01d-09h-18m-09s> P01: BSK |################# | [085%] 01d-08h-11m-28s(E) 01d-13h-52m-19s(X)
<01d-09h-19m-38s> P15: BSK |################# | [085%] 01d-08h-12m-57s(E) 01d-13h-54m-04s(X)
<01d-09h-18m-50s> P16: BSK |################# | [085%] 01d-08h-12m-09s(E) 01d-13h-53m-07s(X)
<01d-09h-18m-52s> P10: BSK |################# | [085%] 01d-08h-12m-11s(E) 01d-13h-53m-09s(X)
<01d-09h-21m-01s> P04: BSK |################# | [085%] 01d-08h-14m-20s(E) 01d-13h-55m-41s(X)
<01d-09h-21m-00s> P08: BSK |################# | [085%] 01d-08h-14m-19s(E) 01d-13h-55m-40s(X)
<01d-09h-21m-02s> P13: BSK |################# | [085%] 01d-08h-14m-21s(E) 01d-13h-55m-43s(X)
<01d-09h-20m-32s> P03: BSK |################# | [085%] 01d-08h-13m-51s(E) 01d-13h-55m-08s(X)
<01d-09h-18m-46s> P06: BSK |################# | [085%] 01d-08h-12m-05s(E) 01d-13h-53m-02s(X)
<01d-12h-12m-29s> P09: BSK |################## | [090%] 01d-11h-05m-48s(E) 01d-14h-59m-47s(X)
<01d-12h-12m-40s> P02: BSK |################## | [090%] 01d-11h-06m-00s(E) 01d-15h-00m-00s(X)
<01d-12h-12m-32s> P11: BSK |################## | [090%] 01d-11h-05m-51s(E) 01d-14h-59m-50s(X)
<01d-12h-12m-38s> P12: BSK |################## | [090%] 01d-11h-05m-57s(E) 01d-14h-59m-57s(X)
<01d-12h-10m-51s> P05: BSK |################## | [090%] 01d-11h-04m-14s(E) 01d-14h-58m-03s(X)
<01d-12h-12m-07s> P15: BSK |################## | [090%] 01d-11h-05m-26s(E) 01d-14h-59m-22s(X)
<01d-12h-11m-25s> P10: BSK |################## | [090%] 01d-11h-04m-44s(E) 01d-14h-58m-36s(X)
<01d-12h-11m-23s> P16: BSK |################## | [090%] 01d-11h-04m-42s(E) 01d-14h-58m-34s(X)
<01d-12h-10m-32s> P01: BSK |################## | [090%] 01d-11h-03m-51s(E) 01d-14h-57m-37s(X)
<01d-12h-13m-33s> P08: BSK |################## | [090%] 01d-11h-06m-52s(E) 01d-15h-00m-58s(X)
<01d-12h-13m-04s> P03: BSK |################## | [090%] 01d-11h-06m-23s(E) 01d-15h-00m-25s(X)
<01d-12h-13m-36s> P13: BSK |################## | [090%] 01d-11h-06m-55s(E) 01d-15h-01m-02s(X)
<01d-12h-11m-46s> P07: BSK |################## | [090%] 01d-11h-05m-08s(E) 01d-14h-59m-02s(X)
<01d-12h-13m-35s> P04: BSK |################## | [090%] 01d-11h-06m-54s(E) 01d-15h-01m-00s(X)
<01d-12h-12m-02s> P14: BSK |################## | [090%] 01d-11h-05m-21s(E) 01d-14h-59m-16s(X)
<01d-12h-11m-19s> P06: BSK |################## | [090%] 01d-11h-04m-38s(E) 01d-14h-58m-29s(X)

2) The only restriction I know for large disk usage is with netcdf. Are you using netcdf? By default Netcdf cannot work with file larger of 2 Gb. Do you have databases larger than that? If this is the case you may need to reconfigure yambo enabling large db (see ./cofigure --help) and recompiling.
We did not use netcdf.
3) Are you using make all or make ypp?
The promblem is solved. We only typed "make yambo interfaces" previously.
Jianfei Ye, PHD
Shanghai Institute of Ceramic Chinese Academy of Science (SICCAS)
Email:jianfeiye@mail.sic.ac.cn

User avatar
andrea marini
Posts: 325
Joined: Mon Mar 16, 2009 4:27 pm
Contact:

Re: Large memory and disk usage.

Post by andrea marini » Mon Nov 02, 2009 9:45 am

Please attach to your reply all text input/output files (o* r* l* *err* *out*) involved in the calculation. We need to know what is the size of your system before being any helpful. In any case the best we can do is to check whether you reached our known limits of the code and/or there is some evident I/O problem. If the code did not stop this means that there were no formal error so my very first idea is that there were some MPI communication problem.

Then a very general question: are you sure you did not push too far the variable values ? I mean, can you reduce the size of the calculation ? Or all the problem resides in the number of G-vectors (that is the only place were you actually see the size of your system) ?

Andrea
Andrea MARINI
Istituto di Struttura della Materia, CNR, (Italy)

chinaye
Posts: 16
Joined: Wed Mar 25, 2009 9:35 am

Re: Large memory and disk usage.

Post by chinaye » Sun Nov 08, 2009 7:05 am

Dear andrea marini,
we need some time to test various possibilities. At present, errors in the I/O process is very likely a reason for this problem.
Jianfei Ye, PHD
Shanghai Institute of Ceramic Chinese Academy of Science (SICCAS)
Email:jianfeiye@mail.sic.ac.cn

Post Reply