Page 1 of 1

Large memory and disk usage.

Posted: Sun Nov 01, 2009 7:48 am
by chinaye
Dear developers,
we now calculate the absorption spectra of relatively large systems. The SAVE folder may exceed 200 G or even 800 G and the memory used per process will be 7-8 G. We find that the calculation will stop when forming the BSK (for instance, it actually stops when the BSK is 45% completed, however, the CPU is still busy). If we restart the job, the same problem will appear again. We want to know whether yambo is capable of handling very large memory and disk usage at present. Can this problem be solved by compiling yambo using more reasonable environment variables. We find that the executable file ypp dose not appear, while there is no errors reported in the make process. Should we revise the Makefile? Thank you very much.

Re: Large memory and disk usage.

Posted: Sun Nov 01, 2009 1:13 pm
by myrta gruning
chinaye wrote: we now calculate the absorption spectra of relatively large systems. The SAVE folder may exceed 200 G or even 800 G and the memory used per process will be 7-8 G. We find that the calculation will stop when forming the BSK (for instance, it actually stops when the BSK is 45% completed, however, the CPU is still busy). If we restart the job, the same problem will appear again. We want to know whether yambo is capable of handling very large memory and disk usage at present. Can this problem be solved by compiling yambo using more reasonable environment variables.
Dear Jianfei Ye,

Does any error message appear either in the report, log, standard err/out? [e.g. a time out because of slow/no communication between nodes]
Are you sure it is not a problem of the job exceeding the available mem?
Also, sometimes (depending on the machine) the log is updated only from time to time? Are you sure the job is really doing nothing (you say CPUs is still busy, also the memory is still occupied by the processes?), or is simply the log that is not written for a long time?

The only restriction I know for large disk usage is with netcdf. Are you using netcdf? By default Netcdf cannot work with file larger of 2 Gb. Do you have databases larger than that? If this is the case you may need to reconfigure yambo enabling large db (see ./cofigure --help) and recompiling.

Please try to give as many details as may be needed to understand the problem.

Now coming to your second point:
chinaye wrote: We find that the executable file ypp dose not appear, while there is no errors reported in the make process. Should we revise the Makefile?
In principle the Makefile coming out from the configure command should be fine, with no need of revising it.

Are you using
make all
or
make ypp?

Can you post the output of the
"make all" or
"make ypp" command
together with all the details (see forum rules)?
(to be sure work from "scratch, by typing "make clean" before, note this will clean all executable in /bin dir as well)

Regards,
Myrta

Re: Large memory and disk usage.

Posted: Mon Nov 02, 2009 2:43 am
by chinaye
Dear Myrta,
thank you for your attention.
1) Are you sure it is not a problem of the job exceeding the available mem?
Also, sometimes (depending on the machine) the log is updated only from time to time? Are you sure the job is really doing nothing (you say CPUs is still busy, also the memory is still occupied by the processes?), or is simply the log that is not written for a long time?
There is no err report, since the job is not terminated. Here are two files CPU and
l_optics_bse_em1s_bss. The former indicates that the cpu is busy and the memory is not exceeded. The job has run for about 72 hours. l_optics_bse_em1s_bss has not been updated for at least 16 hours (we estimated). The BSK is 90% completed.
We find that forming 10% of BSK costs nearly 3 hours.
CPU:
top - 09:56:58 up 11 days, 19:38, 1 user, load average: 15.00, 15.00, 15.00
Tasks: 217 total, 17 running, 200 sleeping, 0 stopped, 0 zombie
Cpu(s): 93.7%us, 0.0%sy, 0.0%ni, 6.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 32962108k total, 31176764k used, 1785344k free, 219984k buffers
Swap: 4200988k total, 2634256k used, 1566732k free, 6310732k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
23317 jfye 25 0 2309m 1.5g 4564 R 100 4.9 4345:33 yambo
23307 jfye 25 0 2309m 1.5g 4404 R 100 4.8 4345:00 yambo
23310 jfye 25 0 2309m 1.7g 4524 R 100 5.4 4343:16 yambo
23311 jfye 25 0 2309m 1.3g 4360 R 100 4.3 4343:33 yambo
23313 jfye 25 0 2309m 1.7g 4376 R 100 5.3 4344:41 yambo
23314 jfye 25 0 2309m 1.7g 4540 R 100 5.4 4343:23 yambo
23316 jfye 25 0 2309m 1.7g 4452 R 100 5.3 4344:49 yambo
23319 jfye 25 0 2309m 1.3g 4548 R 100 4.1 4343:47 yambo
23320 jfye 25 0 2309m 1.7g 4576 R 100 5.4 4343:22 yambo
23306 jfye 25 0 2309m 1.7g 4280 R 100 5.3 4344:51 yambo
23308 jfye 25 0 2309m 1.5g 4532 R 100 4.9 4345:28 yambo
23309 jfye 25 0 2309m 1.3g 4336 R 100 4.2 4342:30 yambo
23312 jfye 25 0 2309m 1.5g 4508 R 100 4.6 4345:29 yambo
23315 jfye 25 0 2309m 1.7g 4524 R 100 5.3 4344:40 yambo
23318 jfye 25 0 2309m 1.3g 4332 R 100 4.2 4343:48 yambo
7498 jfye 16 0 5652 1288 868 R 0 0.0 0:00.04 top

l_optics_bse_em1s_bss:
......
<01d-06h-35m-15s> P02: BSK |################ | [080%] 01d-05h-28m-34s(E) 01d-12h-50m-43s(X)
<01d-06h-35m-05s> P09: BSK |################ | [080%] 01d-05h-28m-24s(E) 01d-12h-50m-30s(X)
<01d-06h-35m-07s> P11: BSK |################ | [080%] 01d-05h-28m-26s(E) 01d-12h-50m-33s(X)
<01d-06h-35m-13s> P12: BSK |################ | [080%] 01d-05h-28m-32s(E) 01d-12h-50m-40s(X)
<01d-06h-34m-59s> P07: BSK |################ | [080%] 01d-05h-28m-21s(E) 01d-12h-50m-26s(X)
<01d-06h-35m-10s> P14: BSK |################ | [080%] 01d-05h-28m-29s(E) 01d-12h-50m-37s(X)
<01d-06h-34m-18s> P16: BSK |################ | [080%] 01d-05h-27m-37s(E) 01d-12h-49m-32s(X)
<01d-06h-34m-20s> P10: BSK |################ | [080%] 01d-05h-27m-39s(E) 01d-12h-49m-33s(X)
<01d-06h-35m-13s> P15: BSK |################ | [080%] 01d-05h-28m-32s(E) 01d-12h-50m-40s(X)
<01d-06h-36m-28s> P08: BSK |################ | [080%] 01d-05h-29m-47s(E) 01d-12h-52m-14s(X)
<01d-06h-36m-29s> P04: BSK |################ | [080%] 01d-05h-29m-48s(E) 01d-12h-52m-15s(X)
<01d-06h-33m-52s> P01: BSK |################ | [080%] 01d-05h-27m-11s(E) 01d-12h-48m-59s(X)
<01d-06h-36m-09s> P03: BSK |################ | [080%] 01d-05h-29m-28s(E) 01d-12h-51m-50s(X)
<01d-06h-33m-59s> P05: BSK |################ | [080%] 01d-05h-27m-22s(E) 01d-12h-49m-13s(X)
<01d-06h-36m-31s> P13: BSK |################ | [080%] 01d-05h-29m-50s(E) 01d-12h-52m-17s(X)
<01d-06h-34m-13s> P06: BSK |################ | [080%] 01d-05h-27m-32s(E) 01d-12h-49m-25s(X)
<01d-09h-19m-58s> P02: BSK |################# | [085%] 01d-08h-13m-17s(E) 01d-13h-54m-27s(X)
<01d-09h-19m-47s> P09: BSK |################# | [085%] 01d-08h-13m-06s(E) 01d-13h-54m-14s(X)
<01d-09h-19m-56s> P12: BSK |################# | [085%] 01d-08h-13m-15s(E) 01d-13h-54m-24s(X)
<01d-09h-19m-50s> P11: BSK |################# | [085%] 01d-08h-13m-09s(E) 01d-13h-54m-18s(X)
<01d-09h-19m-37s> P14: BSK |################# | [085%] 01d-08h-12m-56s(E) 01d-13h-54m-02s(X)
<01d-09h-19m-21s> P07: BSK |################# | [085%] 01d-08h-12m-42s(E) 01d-13h-53m-46s(X)
<01d-09h-18m-29s> P05: BSK |################# | [085%] 01d-08h-11m-53s(E) 01d-13h-52m-48s(X)
<01d-09h-18m-09s> P01: BSK |################# | [085%] 01d-08h-11m-28s(E) 01d-13h-52m-19s(X)
<01d-09h-19m-38s> P15: BSK |################# | [085%] 01d-08h-12m-57s(E) 01d-13h-54m-04s(X)
<01d-09h-18m-50s> P16: BSK |################# | [085%] 01d-08h-12m-09s(E) 01d-13h-53m-07s(X)
<01d-09h-18m-52s> P10: BSK |################# | [085%] 01d-08h-12m-11s(E) 01d-13h-53m-09s(X)
<01d-09h-21m-01s> P04: BSK |################# | [085%] 01d-08h-14m-20s(E) 01d-13h-55m-41s(X)
<01d-09h-21m-00s> P08: BSK |################# | [085%] 01d-08h-14m-19s(E) 01d-13h-55m-40s(X)
<01d-09h-21m-02s> P13: BSK |################# | [085%] 01d-08h-14m-21s(E) 01d-13h-55m-43s(X)
<01d-09h-20m-32s> P03: BSK |################# | [085%] 01d-08h-13m-51s(E) 01d-13h-55m-08s(X)
<01d-09h-18m-46s> P06: BSK |################# | [085%] 01d-08h-12m-05s(E) 01d-13h-53m-02s(X)
<01d-12h-12m-29s> P09: BSK |################## | [090%] 01d-11h-05m-48s(E) 01d-14h-59m-47s(X)
<01d-12h-12m-40s> P02: BSK |################## | [090%] 01d-11h-06m-00s(E) 01d-15h-00m-00s(X)
<01d-12h-12m-32s> P11: BSK |################## | [090%] 01d-11h-05m-51s(E) 01d-14h-59m-50s(X)
<01d-12h-12m-38s> P12: BSK |################## | [090%] 01d-11h-05m-57s(E) 01d-14h-59m-57s(X)
<01d-12h-10m-51s> P05: BSK |################## | [090%] 01d-11h-04m-14s(E) 01d-14h-58m-03s(X)
<01d-12h-12m-07s> P15: BSK |################## | [090%] 01d-11h-05m-26s(E) 01d-14h-59m-22s(X)
<01d-12h-11m-25s> P10: BSK |################## | [090%] 01d-11h-04m-44s(E) 01d-14h-58m-36s(X)
<01d-12h-11m-23s> P16: BSK |################## | [090%] 01d-11h-04m-42s(E) 01d-14h-58m-34s(X)
<01d-12h-10m-32s> P01: BSK |################## | [090%] 01d-11h-03m-51s(E) 01d-14h-57m-37s(X)
<01d-12h-13m-33s> P08: BSK |################## | [090%] 01d-11h-06m-52s(E) 01d-15h-00m-58s(X)
<01d-12h-13m-04s> P03: BSK |################## | [090%] 01d-11h-06m-23s(E) 01d-15h-00m-25s(X)
<01d-12h-13m-36s> P13: BSK |################## | [090%] 01d-11h-06m-55s(E) 01d-15h-01m-02s(X)
<01d-12h-11m-46s> P07: BSK |################## | [090%] 01d-11h-05m-08s(E) 01d-14h-59m-02s(X)
<01d-12h-13m-35s> P04: BSK |################## | [090%] 01d-11h-06m-54s(E) 01d-15h-01m-00s(X)
<01d-12h-12m-02s> P14: BSK |################## | [090%] 01d-11h-05m-21s(E) 01d-14h-59m-16s(X)
<01d-12h-11m-19s> P06: BSK |################## | [090%] 01d-11h-04m-38s(E) 01d-14h-58m-29s(X)

2) The only restriction I know for large disk usage is with netcdf. Are you using netcdf? By default Netcdf cannot work with file larger of 2 Gb. Do you have databases larger than that? If this is the case you may need to reconfigure yambo enabling large db (see ./cofigure --help) and recompiling.
We did not use netcdf.
3) Are you using make all or make ypp?
The promblem is solved. We only typed "make yambo interfaces" previously.

Re: Large memory and disk usage.

Posted: Mon Nov 02, 2009 9:45 am
by andrea marini
Please attach to your reply all text input/output files (o* r* l* *err* *out*) involved in the calculation. We need to know what is the size of your system before being any helpful. In any case the best we can do is to check whether you reached our known limits of the code and/or there is some evident I/O problem. If the code did not stop this means that there were no formal error so my very first idea is that there were some MPI communication problem.

Then a very general question: are you sure you did not push too far the variable values ? I mean, can you reduce the size of the calculation ? Or all the problem resides in the number of G-vectors (that is the only place were you actually see the size of your system) ?

Andrea

Re: Large memory and disk usage.

Posted: Sun Nov 08, 2009 7:05 am
by chinaye
Dear andrea marini,
we need some time to test various possibilities. At present, errors in the I/O process is very likely a reason for this problem.