Skip to content

Make recovery use fine-grained eviction and remove (most) sync forms#1892

Merged
TedHartMS merged 11 commits into
mainfrom
tedhar/recov-ha-nosync
Jun 23, 2026
Merged

Make recovery use fine-grained eviction and remove (most) sync forms#1892
TedHartMS merged 11 commits into
mainfrom
tedhar/recov-ha-nosync

Conversation

@TedHartMS

Copy link
Copy Markdown
Contributor

This pull request refactors the cluster and database recovery logic to be fully asynchronous, replacing synchronous recovery methods with async counterparts throughout the codebase. This change improves scalability and responsiveness during cluster and database startup, ensuring that recovery operations do not block threads unnecessarily. The changes touch core interfaces, implementations, and usage sites, updating method signatures and internal logic to use ValueTask and async/await patterns.

Key changes include:

Asynchronous Recovery Refactor

  • Changed all recovery methods such as Recover, RecoverCheckpoint, and RecoverAOF in cluster and database manager interfaces and implementations to their asynchronous equivalents (RecoverAsync, RecoverCheckpointAsync, and RecoverAOFAsync), updating signatures and call sites accordingly. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18]

  • Updated recovery logic in ReplicationManager and ReplicaDiskbasedSync to call asynchronous recovery methods and use await where appropriate, including transitioning methods like Recover, RecoverCheckpointAndAOF, and related calls to async. [1] [2] [3] [4] [5]

Interface and API Consistency

  • Modified IClusterProvider, IDatabaseManager, and related interfaces to use async recovery methods, ensuring consistency across the codebase and enabling async recovery flows from top-level startup routines down to storage engines. [1] [2] [3]

Synchronous Startup Compatibility

  • Where recovery must still be performed synchronously (such as during server startup), used .AsTask().GetAwaiter().GetResult() with appropriate warnings to maintain compatibility while transitioning to async APIs. [1] [2]

Internal Implementation Updates

  • Refactored internal recovery implementations for SingleLog, ShardedLog, and related storage classes to provide async recovery methods, and updated their usage throughout the codebase. [1] [2] [3]

  • Modified Recovery eviction logic to be two-pass:

    • Pass1 loads pages without loading objects, and evicts pages to keep memory usage within budget.
    • If there are objects, pass2 loads them from the high page addresses down, evicting pages as needed to stay within the memory budget. If pages cannot be evicted, then earlier records on the same page are evicted, and headAddress advances on a per-record basis.

These changes collectively modernize the recovery path, improve non-blocking behavior, and lay the groundwork for further async enhancements in cluster and database management.

Copilot AI review requested due to automatic review settings June 19, 2026 22:23

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request refactors Garnet/Tsavorite recovery to be predominantly asynchronous (moving call chains to RecoverAsync / ValueTask) and updates Tsavorite’s snapshot recovery to support fine-grained, budget-aware eviction with a two-pass “read pages first, load objects later” flow.

Changes:

  • Replaced many synchronous recovery entry points (cluster, database manager, AOF, TsavoriteLog, TsavoriteKV) with async counterparts, updating call sites and tests accordingly.
  • Implemented/extended recovery-time eviction and deferred object loading for snapshot recovery, including object-log byte copying from snapshot object-log into the main object-log to make pages evictable under memory pressure.
  • Added/updated tests to cover async recovery and snapshot recovery + eviction scenarios.

Reviewed changes

Copilot reviewed 53 out of 53 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
test/standalone/Garnet.test/RespConfigTests.cs Updates a size-tracker precondition in a config/eviction-related test.
test/standalone/Garnet.test.collections/GarnetObjectTests.cs Switches store recovery in collection tests to RecoverAsync.
libs/storage/Tsavorite/cs/test/TestUtils.cs Removes CompletionSyncMode enum used by tests.
libs/storage/Tsavorite/cs/test/test.session.context/UnsafeContextTests.cs Removes sync completion mode path; uses async completion APIs in unsafe-context tests.
libs/storage/Tsavorite/cs/test/test.session.context/TransactionalUnsafeContextTests.cs Removes sync completion mode path; uses async completion APIs in transactional unsafe-context tests.
libs/storage/Tsavorite/cs/test/test.recovery/SimpleRecoveryTest.cs Updates recovery/checkpoint tests to async-only completion and recovery.
libs/storage/Tsavorite/cs/test/test.recovery/RecoveryTests.cs Updates recovery tests to async-only recovery/checkpointing paths.
libs/storage/Tsavorite/cs/test/test.recovery/RecoveryCheckTests.cs Updates multiple recovery-check tests to async flows and adds ConfigureAwait(false) in some awaits.
libs/storage/Tsavorite/cs/test/test.recovery/ObjectRecoveryTest3.cs Updates object recovery test to async-only recovery.
libs/storage/Tsavorite/cs/test/test.recovery/ObjectRecoveryTest2.cs Updates object recovery test to async-only recovery/checkpoint completion.
libs/storage/Tsavorite/cs/test/test.recovery/ObjectRecoveryTest.cs Updates object recovery test to async-only recovery and removes unused static import.
libs/storage/Tsavorite/cs/test/test.recovery/ObjectRecoverySnapshotEvictionTests.cs New tests exercising snapshot deferred object load + eviction, plus compact+truncate after recovery.
libs/storage/Tsavorite/cs/test/test.recovery/LargeObjectTests.cs Updates large-object recovery/checkpoint tests to RecoverAsync / async completion.
libs/storage/Tsavorite/cs/test/test.recovery/ComponentRecoveryTests.cs Converts component recovery tests to async (RecoverAsync APIs with cancellation tokens).
libs/storage/Tsavorite/cs/test/test.recordops/RecordLifecycleTests.cs Minor test message punctuation tweak.
libs/storage/Tsavorite/cs/test/test.hlog/LogTests.cs Converts TsavoriteLog manual commit test to async RecoverAsync.
libs/storage/Tsavorite/cs/test/test.hlog/LogRecoverReadOnlyTests.cs Removes sync/async toggle; uses async-only RecoverReadOnlyAsync and async log creation.
libs/storage/Tsavorite/cs/test/test.hlog/LogFastCommitTests.cs Converts fast-commit test to async RecoverAsync.
libs/storage/Tsavorite/cs/test/test.hlog/FlakyDeviceTests.cs Removes trailing whitespace/blank line.
libs/storage/Tsavorite/cs/test/SharedDirectoryTests.cs Removes sync/async toggle; uses async-only store recovery.
libs/storage/Tsavorite/cs/test/MiscTests.cs Converts a recovery test to async RecoverAsync.
libs/storage/Tsavorite/cs/src/core/Utilities/PageAsyncResultTypes.cs Adds RecoveryPhase and extends async page result types with recovery-phase and snapshot-copy metadata.
libs/storage/Tsavorite/cs/src/core/TsavoriteLog/TsavoriteLog.cs Introduces async RecoverAsync, removes sync RecoverReadOnly, refactors restore helpers to async; updates internal state tracking.
libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Tsavorite.cs Removes sync Recover overloads; relies on async recovery and sync-bridge (GetResult) in a few legacy/compat paths.
libs/storage/Tsavorite/cs/src/core/Index/Recovery/Recovery.cs Major refactor of recovery driver to async; implements two-pass recovery with page trimming and deferred object loading with eviction support.
libs/storage/Tsavorite/cs/src/core/Index/Recovery/IndexRecovery.cs Removes sync fuzzy-index recovery/wait helpers; leaves async-only recovery APIs.
libs/storage/Tsavorite/cs/src/core/Index/Common/LogSizeTracker.cs Renames/adjusts size-tracker APIs (IsOverBudget, RemainingBudget), updates eviction range logic and page scanning.
libs/storage/Tsavorite/cs/src/core/Index/Common/LogSettings.cs Adds kMinPageCount constant used for min memory sizing.
libs/storage/Tsavorite/cs/src/core/Allocator/TsavoriteLogAllocator.cs Adds new interface methods (e.g., GetPageObjectIdMap) and updated eviction signature.
libs/storage/Tsavorite/cs/src/core/Allocator/SpanByteAllocator.cs Adds new interface methods and updated eviction signature (no-op for record-eviction range).
libs/storage/Tsavorite/cs/src/core/Allocator/ObjectSerialization/ObjectLogWriter.cs Adds recovery-time snapshot object-byte copy helper (CopyRecoveredObjectBytes).
libs/storage/Tsavorite/cs/src/core/Allocator/ObjectIdMap.cs Adds IsEmpty helper used by eviction/recovery code paths.
libs/storage/Tsavorite/cs/src/core/Allocator/ObjectAllocatorImpl.cs Extends per-record eviction API with recovery awareness; implements snapshot object-log copy during recovery flush and demand-loaded reader setup.
libs/storage/Tsavorite/cs/src/core/Allocator/ObjectAllocator.cs Adds GetPageObjectIdMap plumbing and updated eviction signature.
libs/storage/Tsavorite/cs/src/core/Allocator/MallocFixedPageSize.cs Removes sync recovery/wait helpers; keeps async recovery API.
libs/storage/Tsavorite/cs/src/core/Allocator/LogRecord.cs Adds RepointObjectLogPosition to support snapshot->main object-log copy during recovery flush.
libs/storage/Tsavorite/cs/src/core/Allocator/IAllocator.cs Extends allocator interface with GetPageObjectIdMap and EvictRecordsInRange(..., isRecovery).
libs/storage/Tsavorite/cs/src/core/Allocator/AllocatorBase.cs Adds recovery/object-load helpers, refactors page allocation helper, extends recovery read/flush APIs with recovery phase and snapshot copy metadata.
libs/storage/Tsavorite/cs/benchmark/YCSB.benchmark/TestLoader.cs Updates recovery call to RecoverAsync().GetResult() in benchmark loader.
libs/server/StoreWrapper.cs Converts recovery flow to RecoverAsync, and makes checkpoint/AOF recovery async.
libs/server/Providers/GarnetProvider.cs Exposes async recovery API from provider.
libs/server/Databases/SingleDatabaseManager.cs Converts checkpoint and AOF recovery to async and updates internal recovery routines accordingly.
libs/server/Databases/MultiDatabaseManager.cs Converts checkpoint and AOF recovery to async for multi-db mode, updating error messaging.
libs/server/Databases/IDatabaseManager.cs Changes recovery APIs to async (RecoverCheckpointAsync, RecoverAOFAsync).
libs/server/Databases/DatabaseManagerBase.cs Refactors shared recovery helpers to async (RecoverDatabaseCheckpointAsync, RecoverDatabaseAOFAsync).
libs/server/Cluster/IClusterProvider.cs Changes cluster recovery API to RecoverAsync.
libs/server/AOF/SingleLog.cs Converts AOF log recovery to async.
libs/server/AOF/ShardedLog.cs Converts sharded AOF recovery to async (awaits per-sublog recovery).
libs/server/AOF/GarnetLog.cs Converts GarnetLog recovery to async, delegating to single/sharded logs.
libs/host/GarnetServer.cs Bridges server startup to async recovery with .GetAwaiter().GetResult() and warning suppression.
libs/cluster/Server/Replication/ReplicationManager.cs Converts replication recovery driver to async and updates logging.
libs/cluster/Server/Replication/ReplicaOps/ReplicaDiskbasedSync.cs Awaits async AOF recovery; sync-bridges async checkpoint recovery in a synchronous RESP path with warning suppression.
libs/cluster/Server/ClusterProvider.cs Switches cluster recovery to async delegation to replication manager.

Comment thread libs/storage/Tsavorite/cs/src/core/Index/Common/LogSizeTracker.cs Outdated
Comment thread libs/storage/Tsavorite/cs/src/core/Index/Recovery/Recovery.cs Outdated
@TedHartMS TedHartMS requested a review from badrishc June 23, 2026 01:25
@TedHartMS TedHartMS merged commit 9f1b6c6 into main Jun 23, 2026
399 of 401 checks passed
@TedHartMS TedHartMS deleted the tedhar/recov-ha-nosync branch June 23, 2026 19:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants