Issue733

Title reduce translator memory usage
Priority feature Status resolved
Superseder Nosy List florian, jendrik, malte
Assigned To jendrik Keywords translator
Optional summary

Created on 2017-08-25.12:08:26 by jendrik, last changed by jendrik.

Files
File name Uploaded Type Edit Remove
issue733.patch malte, 2017-09-01.02:42:58 text/x-patch
show_memory_profile.py malte, 2017-09-01.02:49:28 text/x-python
translate.py.with.snapshots malte, 2017-09-01.02:47:09 application/octet-stream
translator-memory-usage.txt jendrik, 2017-08-25.12:14:37 text/plain
translator-memory.py jendrik, 2017-08-25.12:08:26 text/x-python
Messages
msg6525 (view) Author: jendrik Date: 2017-09-04.16:18:12
Merged and e-mail sent.
msg6511 (view) Author: malte Date: 2017-09-03.15:53:52
Excellent! Can you merge this and let the others know that we can now translate
all tasks with normal memory requirements? (I think some of the others, at least
Silvan, might have stumbled over these errors.)
msg6508 (view) Author: jendrik Date: 2017-09-03.14:47:18
The experiment is finished:

http://ai.cs.unibas.ch/_tmp_files/seipp/issue733-v1-issue733-base-issue733-v1-compare.html

Results look good. Revision v1 can translate all tasks using either Python 2.7 or 3.5 with 3872 MiB. The 
same holds for the base revision using Python 3.5. (This can be seen most quickly by inspecting the 
unexplained-errors which are due to the fact that "tar" complains about the missing "output.sas" file on 
stderr.)

Here is a table comparing v1 for the two Python versions. Python 3.5 uses less memory, but is usually 
slower:

http://ai.cs.unibas.ch/_tmp_files/seipp/compare-python-versions.html
msg6480 (view) Author: jendrik Date: 2017-09-01.10:34:58
The code is here: https://bitbucket.org/jendrikseipp/downward/pull-requests/74

Experiment is queued.
msg6479 (view) Author: jendrik Date: 2017-09-01.09:38:08
Nice! I'll run an experiment.
msg6478 (view) Author: malte Date: 2017-09-01.02:49:28
And here is a script to show information about the memory profiles that the
modified translate.py generates. (They are dumped to JSON files with raw data,
and this script summarizes these to give a table like the one I gave in msg6475.)
msg6477 (view) Author: malte Date: 2017-09-01.02:47:09
If you want to play around with the memory profiler (Ubuntu package
python-meliae), here is a modified translate.py file. Change HACK_ONLY_TAGS and
add more calls to do_memory_snapshot to affect where snapshots are taken. (It
should be fairly obvious from the diff to the regular translate.py.)

Don't run this on the grid: it creates *huge* (in the gigabytes) output files.
It probably makes more sense to use this on tasks that are not the very largest
ones. I used smaller PSR tasks to explore and only switched to #50 when I knew
which snapshot I wanted.
msg6476 (view) Author: malte Date: 2017-09-01.02:42:58
The patch to use __slots__ is attached.

As a next step, it would be great if someone other than me could run an
experiment with or without this code, measuring translator time and memory use.
Any volunteers?

It would also be great to compare Python 2.7 and Python 3.x, for a total of 4
configurations.
msg6475 (view) Author: malte Date: 2017-09-01.02:40:50
I played around a bit with the meliae memory profiler for Python.

It's quite basic and not really documented. This seems to be the best available
information on it:
http://jam-bazaar.blogspot.de/2009/11/memory-debugging-with-meliae.html. I
mainly chose it because it is available in the standard Ubuntu repository,
though only for Python 2 (python-meliae).

Here is the memory summary for PSR-Large #50, our 2nd worst example, with a
memory snapshot taken right after the "Translating task" block in translate.py.
(I tried snapshots taken at various places within translate.main and
translate.pddl_to_sas. This place was where I saw the highest memory usage for
this task, probably because the next step is simplification, which for this task
is able to throw away lots and lots of things.)

Total 22045420 objects, 75 types, Total size = 2977.5MiB (3122177101 bytes)
 Index   Count   %       Size   % Cum     Max Kind
     0 3634140  16 1250144160  40  40     344 Atom
     1 4734669  21  531867936  17  57 7051064 list
     2 5605588  25  423745744  13  70    1792 tuple
     3  783439   3  275770528   8  79     352 PropositionalAxiom
     4  544209   2  191561568   6  85     352 SASMutexGroup
     5  798018   3  136621644   4  89     184 unicode
     6 4724225  21  113381400   3  93      24 int
     7 1095692   4   88312329   2  96    4871 str
     8     468   0   51042144   1  9825166104 dict
     9  113884   0   40087168   1  99     352 SASAxiom
    10     110   0   16806320   0  9916777448 set
    11    1541   0     530104   0  99     344 NegatedAtom
    12    1122   0     394944   0  99     352 SASOperator
    13    1122   0     394944   0  99     352 PropositionalAction
    14     952   0     327488   0  99     344 TypedObject
    15     103   0     294192   0  99   12624 module
    16     272   0     245888   0  99     904 type
    17    1402   0     179456   0  99     128 code
    18    1449   0     173880   0  99     120 function
    19    1057   0      84560   0  99      80 wrapper_descriptor

(I've slightly reformatted the table from meliae's output to avoid columns
running into each other.)

The top row shows us that 16% of objects in the snapshot are of class Atom, and
they add up to a size of 1250144160 bytes. So more than 1 GiB is locked up in
these atoms alone, and this is likely an underestimate of how much space they
use because the total size for everything reported (2977.5 MiB) is about 25%
lower than the actual memory usage, which is likely due to internal stuff and
losses that meliae cannot track.

Diving Size by Count in the first row, we see that (at least according to
meliae), each Atom takes 344 bytes. This is quite a lot because atoms only have
three attributes, and the size of an atom does not include the recursive size of
its attributes, just the equivalent of three references. To see if we can reduce
this, I added a __slots__ declaration to pddl.conditions.Literal (the base class
of Atom), and at least on my computer it leads to very large memory savings. I
tested Python 2.7 and Python 3.5:

psr-large:p50-s219-n100-l3-f30.pddl:

current code,   Python 2.7: peak mem 4114668 KiB, runtime 210.68s
with __slots__, Python 2.7: peak mem 2946448 KiB, runtime 208.69s
current code,   Python 3.5: peak mem 2685192 KiB, runtime 199.65s
with __slots__, Python 3.5: peak mem 2505324 KiB, runtime 198.54s

So we see that for this instance, __slots__ solves our memory problem, at least
in the sense that we should be able to translate it in a "normal" task on our grid.

We also see that Python 3.5 does much better than Python 2.7 here and that
__slots__  has much less impact there (although it still saves roughly 180 MiB
of memory). I think this is because Python 3 has recently seen great
improvements in the dict implementation for the case of object attribute dicts.
In particular, key-sharing dicts probably reduce the object overhead hugely,
which makes __slots__ much less useful. Here is a very nice talk on the topic
that I watched recently: https://www.youtube.com/watch?v=p33CVV29OG8

The data also indicates that we might want to investigate using Python 3 on the
grid in the future. I'm not sure how easy this would be and if suitable modules
are available out of the box. For Python 3, the minor version number makes a
huge difference in (time and memory) performance, so we would probably want a
fairly recent version, such as Python 3.5. With Python 3.5, we wouldn't have had
a memory issue for this task in the first place, so it's probably worth an
experiment.

[Addendum]

I have now also checked the worst Satellite and Scanalyzer tasks according to
Jendrik's data. Together with the PSR-Large task above, these are the top three
memory hogs. All other tasks in the top 10 come from the same domains, so
checking these three is hopefully sufficiently representative.

Here, I didn't repeat the memory profiling with meliae, though it might be a
good idea to do this and see if there are other classes that have a huge number
of instances, as PSR is in many ways special.

What I did observe without looking too closely is that in Satellite peak memory
usage is reached slightly later, I think somewhere close to the end of
simply.filter_unreachable_propositions. This makes sense because in Satellite
there is very little to prune here, and the filtering code itself needs memory
for its own computations.

satellite:p33-HC-pfile13.pddl:
current code,   Python 2.7: peak mem 4165552 KiB, runtime 189.63s
with __slots__, Python 2.7: peak mem 3373232 KiB, runtime 176.42s
current code,   Python 3.5: peak mem 3196556 KiB, runtime 199.98s
with __slots__, Python 3.5: peak mem 2936464 KiB, runtime 195.09s

So in this Satellite task, Python 2.7 is a bit faster than Python 3.5, but I
think Python 3.5 would still be a decent choice due to the lower memory usage.

Finally, the Scanalyzer results. Here, it looks like peak memory usage is
reached at a similar place as with Satellite, which makes sense to me.

scanalyzer-08-strips:p28.pddl:
current code,   Python 2.7: peak mem 3936680 KiB, runtime 162.37s
with __slots__, Python 2.7: peak mem 2745500 KiB, runtime 156.30s
current code,   Python 3.5: peak mem 2832972 KiB, runtime 177.85s
with __slots__, Python 3.5: peak mem 2440588 KiB, runtime 164.14s

Again, Python 3.5 is slightly slower, but more memory-efficient.
msg6469 (view) Author: malte Date: 2017-08-25.15:15:44
(I deleted one of the two copies of translator-memory.py, as there were two
identical attachments.)
msg6468 (view) Author: malte Date: 2017-08-25.14:13:15
I searched around a bit in the slurm source code, man pages and our
/etc/slurm/slurm.conf file:

1) slurm interprets "K" as 1024, "M" as 1024^2, etc.
2) /etc/slurm/slurm.conf contains these configuration entries for us:

NodeName=ase[01-24] CPUs=16 RealMemory=61964 Sockets=2 CoresPerSocket=8
ThreadsPerCore=1 State=UNKNOWN
PartitionName=infai Nodes=ase[01-24] Default=NO State=UP
AllowGroups=infai,scicore AllowQos=infai,30min

Dividing the "RealMemory" entry 61964 by 16 gives 3872.75, so it makes sense
that 3872 MiB is our limit.
msg6467 (view) Author: malte Date: 2017-08-25.13:58:42
Thanks! Do you know what "M" means in this context? Depending on context, it
could be 1024 * 1024 bytes, 1000 * 1000 bytes, or sometimes even 1000 * 1024 bytes.
msg6466 (view) Author: jendrik Date: 2017-08-25.13:43:44
A limit of 3872M results in 16 used cores, whereas a limit of 3873M only uses 15 
cores.
msg6465 (view) Author: jendrik Date: 2017-08-25.13:28:34
Yes, no other jobs were running on the grid. I'll see if I can find the exact 
limit.
msg6464 (view) Author: malte Date: 2017-08-25.13:18:51
Have you tested this while no other jobs were running on the grid? Presumably
Slurm adds up the memory requirement of all jobs on the same node, so the memory
settings for other jobs scheduled for the same machine matter. I think it would
be interesting to find out the exact limit if we can.
msg6463 (view) Author: jendrik Date: 2017-08-25.12:46:48
I have to correct my earlier comment: setting --mem-per-cpu=4G uses 15 instead of 
16 cores. On further investigation I found that a limit of 3850M leads to using 
16 cores, while 3900M leads to using 15 cores.
msg6462 (view) Author: jendrik Date: 2017-08-25.12:15:11
In an experiment setting --mem-per-cpu=4G, only one task couldn't be translated: 
satellite:p33-HC-pfile13.pddl.

However, this might only be luck, since memory usage fluctuates a bit between 
different runs and we never know when Slurm decides to kill a process.
msg6461 (view) Author: jendrik Date: 2017-08-25.12:14:37
Here are the results in KiB.
msg6460 (view) Author: jendrik Date: 2017-08-25.12:10:51
I'm attaching an experiment for measuring the peak memory usage of the 
translator. It uses a memory limit of 8 GiB.
msg6459 (view) Author: jendrik Date: 2017-08-25.12:08:26
Currently, each computer in our compute grid has 16 cores and 64 GiB memory. 
Since we want to use all cores in parallel, this means that each core should be 
allowed to use 4 GiB. The old grid engine allowed processes to temporarily use 
more than 4 GiB, but the new Slurm engine is more restrictive.

It appears that setting --mem-per-cpu=4G in fact uses all 16 cores in parallel. 
However, the translator currently needs more than 4 GiB for one task in our 
standard benchmark set. Since we'd like to be able to translate this task and 
other big tasks not part of our benchmark set, we'd like to reduce the memory 
footprint of the translator.
History
Date User Action Args
2017-09-04 16:18:12jendriksetstatus: testing -> resolved
messages: + msg6525
2017-09-03 15:53:52maltesetmessages: + msg6511
2017-09-03 14:47:19jendriksetmessages: + msg6508
2017-09-01 10:34:58jendriksetassignedto: jendrik
messages: + msg6480
2017-09-01 09:38:08jendriksetstatus: chatting -> testing
messages: + msg6479
2017-09-01 02:49:28maltesetfiles: + show_memory_profile.py
messages: + msg6478
2017-09-01 02:47:09maltesetfiles: + translate.py.with.snapshots
messages: + msg6477
2017-09-01 02:42:58maltesetfiles: + issue733.patch
messages: + msg6476
2017-09-01 02:40:50maltesetmessages: + msg6475
2017-08-26 05:31:34floriansetnosy: + florian
2017-08-25 15:15:44maltesetmessages: + msg6469
2017-08-25 15:15:18maltesetfiles: - translator-memory.py
2017-08-25 14:13:15maltesetmessages: + msg6468
2017-08-25 13:58:42maltesetmessages: + msg6467
2017-08-25 13:43:44jendriksetmessages: + msg6466
2017-08-25 13:28:34jendriksetmessages: + msg6465
2017-08-25 13:18:51maltesetmessages: + msg6464
2017-08-25 12:46:48jendriksetmessages: + msg6463
2017-08-25 12:15:11jendriksetmessages: + msg6462
2017-08-25 12:14:37jendriksetfiles: + translator-memory-usage.txt
messages: + msg6461
2017-08-25 12:10:51jendriksetfiles: + translator-memory.py
status: unread -> chatting
messages: + msg6460
2017-08-25 12:08:26jendrikcreate