fix(runtime): autosave engine-implicit RuntimeCache via atexit + weakref#4362
Open
tp5uiuc wants to merge 1 commit into
Open
fix(runtime): autosave engine-implicit RuntimeCache via atexit + weakref#4362tp5uiuc wants to merge 1 commit into
tp5uiuc wants to merge 1 commit into
Conversation
6d02dad to
a0a67d7
Compare
a0a67d7 to
a3770be
Compare
The previous `__del__`-only autosave path silently lost cache updates when the engine survived until interpreter exit (typical for inference servers). Python tears down `sys.meta_path` early in shutdown; the torchbind attribute access in `self._handle.path` and the lazy `filelock` import inside `save()` then raise `ImportError: sys.meta_path is None`, which escaped `__del__` and surfaced as a noisy `Exception ignored in __del__` once per surviving handle. `atexit` callbacks run *before* module teardown, so the torchbind path and lazy imports still resolve there. Register an `atexit` hook from `__init__` whenever `autosave_on_del=True`. The hook closes over a `weakref.ref(self)` so it doesn't pin the handle alive: a handle that dies mid-program still goes through `__del__`, and the atexit hook later sees a dead weakref and no-ops. Other design points worth calling out: * `_autosave_at_exit` is a module-level helper, not a bound method. A bound method captures `self` via `__self__`, which would defeat the weakref. The free function lets the closure carry only the weakref. * Both `__del__` and the atexit hook flip `autosave_on_del` off before saving so whichever path runs first wins and the other no-ops -- no double-save, no double-leak risk. * `__del__` unregisters its atexit token. Without this, a long-running process that churns engine-implicit handles (model swaps, A/B rollouts) accumulates dead atexit entries -- small per entry but unbounded. * The `try` in `__del__` still wraps the whole body so any residual attribute-access failure during late-shutdown corner cases is swallowed rather than leaking to `sys.unraisablehook`. Tests added in `TestRuntimeCacheAutosave`: - `test_del_swallows_shutdown_import_error_on_path`: monkey-patches `_handle.path` to raise the shutdown `ImportError`; asserts via `sys.unraisablehook` that nothing leaks. - `test_atexit_hook_saves_via_weakref`: exercises the helper directly, verifies it saves and flips `autosave_on_del`. - `test_atexit_hook_no_op_on_dead_weakref`: dead weakref => no-op, no exception. - `test_atexit_token_unregistered_after_del`: `atexit.unregister` is spied to confirm `__del__` cleaned up. Refs pytorch#4359
a3770be to
eacb80e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
RuntimeCachehandles (those constructed by an enginewhen
RuntimeSettings(runtime_cache="/path")is used) previously reliedsolely on
__del__to autosave the kernel cache. When the enginesurvived until interpreter exit (typical for inference servers),
__del__fired during shutdown — by which point Python had torndown
sys.meta_path, breaking the torchbind attribute access inself._handle.pathand the lazyfilelockimport insidesave().The resulting
ImportErrorescaped__del__and printedException ignored in: <function RuntimeCache.__del__>once persurviving handle (🐛 [Bug] Serialization fails in Runtime Cache #4359).
atexit, which firesbefore module teardown.
__del__remains as the mid-program GC path.Design notes
atexit.register(partial(_autosave_at_exit, weakref.ref(self)))keepsthe registration non-owning: a handle that dies mid-program is still
collected normally; the atexit hook later sees a dead weakref and
no-ops. A bound method would have defeated the weakref, hence the
module-level free function +
partial._autosave_if_enabled(called from both
__del__and the atexit hook). It flipsautosave_on_deloff before saving so whichever path runs first winsand the other no-ops — no double-save, no double-leak risk.
__del__callsatexit.unregister(self._atexit_token)so theregistry size stays
O(live handles)rather thanO(all handles ever created)— matters for long-running processes that churn engines.self._atexit_token = Noneis the first statement in__init__, so__del__can always read it safely even if a later line of__init__raises (partial-init defense).
Refs #4359 — the topology crash that surfaced this issue is fixed in a
separate PR; this PR addresses the autosave reliability + shutdown
noise.
Type of change
Test plan
Added under
TestRuntimeCacheAutosaveintests/py/dynamo/runtime/test_000_runtime_cache.py:test_del_swallows_shutdown_import_error_on_path— monkey-patches_handle.pathto raise the shutdownImportError; asserts viasys.unraisablehookthat nothing leaks.test_atexit_hook_saves_via_weakref— exercises the helperdirectly; verifies it routes through
_autosave_if_enabled, saves,and flips
autosave_on_deloff.test_atexit_hook_no_op_on_dead_weakref— dead weakref → no-op,no exception.
test_atexit_token_unregistered_after_del—atexit.unregisteris spied to confirm
__del__cleaned up the registration with thespecific
partialtoken.test_000_runtime_cache.pypasses locally on cpp-rt (17passed, 6 unrelated skips on the py-runtime whitebox tests).
pre-commit runclean on touched files.