Skip to content

fix(runtime): autosave engine-implicit RuntimeCache via atexit + weakref#4362

Open
tp5uiuc wants to merge 1 commit into
pytorch:mainfrom
tp5uiuc:fix/runtime-cache-del-shutdown
Open

fix(runtime): autosave engine-implicit RuntimeCache via atexit + weakref#4362
tp5uiuc wants to merge 1 commit into
pytorch:mainfrom
tp5uiuc:fix/runtime-cache-del-shutdown

Conversation

@tp5uiuc

@tp5uiuc tp5uiuc commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Engine-implicit RuntimeCache handles (those constructed by an engine
    when RuntimeSettings(runtime_cache="/path") is used) previously relied
    solely on __del__ to autosave the kernel cache. When the engine
    survived until interpreter exit (typical for inference servers),
    __del__ fired during shutdown — by which point Python had torn
    down sys.meta_path, breaking the torchbind attribute access in
    self._handle.path and the lazy filelock import inside save().
    The resulting ImportError escaped __del__ and printed
    Exception ignored in: <function RuntimeCache.__del__> once per
    surviving handle (🐛 [Bug] Serialization fails in Runtime Cache #4359).
  • Move the engine-implicit autosave anchor to atexit, which fires
    before module teardown. __del__ remains as the mid-program GC path.

Design notes

  • atexit.register(partial(_autosave_at_exit, weakref.ref(self))) keeps
    the registration non-owning: a handle that dies mid-program is still
    collected normally; the atexit hook later sees a dead weakref and
    no-ops. A bound method would have defeated the weakref, hence the
    module-level free function + partial.
  • The autosave logic lives once on the class as _autosave_if_enabled
    (called from both __del__ and the atexit hook). It flips
    autosave_on_del off before saving so whichever path runs first wins
    and the other no-ops — no double-save, no double-leak risk.
  • __del__ calls atexit.unregister(self._atexit_token) so the
    registry size stays O(live handles) rather than O(all handles ever created) — matters for long-running processes that churn engines.
  • self._atexit_token = None is the first statement in __init__, so
    __del__ can always read it safely even if a later line of __init__
    raises (partial-init defense).

Refs #4359 — the topology crash that surfaced this issue is fixed in a
separate PR; this PR addresses the autosave reliability + shutdown
noise.

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Test plan

Added under TestRuntimeCacheAutosave in
tests/py/dynamo/runtime/test_000_runtime_cache.py:

  • test_del_swallows_shutdown_import_error_on_path — monkey-patches
    _handle.path to raise the shutdown ImportError; asserts via
    sys.unraisablehook that nothing leaks.
  • test_atexit_hook_saves_via_weakref — exercises the helper
    directly; verifies it routes through _autosave_if_enabled, saves,
    and flips autosave_on_del off.
  • test_atexit_hook_no_op_on_dead_weakref — dead weakref → no-op,
    no exception.
  • test_atexit_token_unregistered_after_delatexit.unregister
    is spied to confirm __del__ cleaned up the registration with the
    specific partial token.
  • Full test_000_runtime_cache.py passes locally on cpp-rt (17
    passed, 6 unrelated skips on the py-runtime whitebox tests).
  • pre-commit run clean on touched files.

@meta-cla meta-cla Bot added the cla signed label Jun 24, 2026
@github-actions github-actions Bot added component: tests Issues re: Tests component: api [Python] Issues re: Python API labels Jun 24, 2026
@github-actions github-actions Bot requested a review from lanluo-nvidia June 24, 2026 17:21
@tp5uiuc tp5uiuc self-assigned this Jun 24, 2026
@tp5uiuc tp5uiuc marked this pull request as draft June 24, 2026 17:22
@tp5uiuc tp5uiuc force-pushed the fix/runtime-cache-del-shutdown branch from 6d02dad to a0a67d7 Compare June 25, 2026 00:53
@tp5uiuc tp5uiuc changed the title fix(runtime): silence ImportError from RuntimeCache.__del__ fix(runtime): autosave engine-implicit RuntimeCache via atexit + weakref Jun 25, 2026
@tp5uiuc tp5uiuc force-pushed the fix/runtime-cache-del-shutdown branch from a0a67d7 to a3770be Compare June 25, 2026 01:52
The previous `__del__`-only autosave path silently lost cache updates when
the engine survived until interpreter exit (typical for inference servers).
Python tears down `sys.meta_path` early in shutdown; the torchbind
attribute access in `self._handle.path` and the lazy `filelock` import
inside `save()` then raise `ImportError: sys.meta_path is None`, which
escaped `__del__` and surfaced as a noisy `Exception ignored in __del__`
once per surviving handle.

`atexit` callbacks run *before* module teardown, so the torchbind path
and lazy imports still resolve there. Register an `atexit` hook from
`__init__` whenever `autosave_on_del=True`. The hook closes over a
`weakref.ref(self)` so it doesn't pin the handle alive: a handle that
dies mid-program still goes through `__del__`, and the atexit hook later
sees a dead weakref and no-ops.

Other design points worth calling out:

* `_autosave_at_exit` is a module-level helper, not a bound method. A
  bound method captures `self` via `__self__`, which would defeat the
  weakref. The free function lets the closure carry only the weakref.

* Both `__del__` and the atexit hook flip `autosave_on_del` off before
  saving so whichever path runs first wins and the other no-ops -- no
  double-save, no double-leak risk.

* `__del__` unregisters its atexit token. Without this, a long-running
  process that churns engine-implicit handles (model swaps, A/B
  rollouts) accumulates dead atexit entries -- small per entry but
  unbounded.

* The `try` in `__del__` still wraps the whole body so any residual
  attribute-access failure during late-shutdown corner cases is swallowed
  rather than leaking to `sys.unraisablehook`.

Tests added in `TestRuntimeCacheAutosave`:
- `test_del_swallows_shutdown_import_error_on_path`: monkey-patches
  `_handle.path` to raise the shutdown `ImportError`; asserts via
  `sys.unraisablehook` that nothing leaks.
- `test_atexit_hook_saves_via_weakref`: exercises the helper directly,
  verifies it saves and flips `autosave_on_del`.
- `test_atexit_hook_no_op_on_dead_weakref`: dead weakref => no-op, no
  exception.
- `test_atexit_token_unregistered_after_del`: `atexit.unregister` is
  spied to confirm `__del__` cleaned up.

Refs pytorch#4359
@tp5uiuc tp5uiuc force-pushed the fix/runtime-cache-del-shutdown branch from a3770be to eacb80e Compare June 25, 2026 02:10
@tp5uiuc tp5uiuc marked this pull request as ready for review June 25, 2026 02:12
@tp5uiuc tp5uiuc requested a review from cehongwang June 25, 2026 02:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed component: api [Python] Issues re: Python API component: tests Issues re: Tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant