feat(coding-agent): cache successful document conversions#3264
feat(coding-agent): cache successful document conversions#3264wolfiesch wants to merge 4 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a transparent, bounded, content-addressed cache for successful Markit document conversions in the coding-agent, plus a new XDG-aware cache directory helper in @oh-my-pi/pi-utils, so repeated reads of unchanged documents reuse previously converted markdown.
Changes:
- Introduces a content-hash + normalized-extension cache layer for Markit conversions (buffer + file wrappers), with best-effort bounded pruning.
- Adds
getDocumentConversionCacheDir()to@oh-my-pi/pi-utils/dirs(XDG-aware) and documents it in the utils changelog. - Adds/updates tests to validate cache behavior (hit/miss/skip/invalidation) and reuse across full + selector PDF reads.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/utils/test/dirs-cache.test.ts | Tests XDG vs custom agent-dir resolution for the new document conversion cache directory helper. |
| packages/utils/src/dirs.ts | Adds getDocumentConversionCacheDir() using the existing agentSubdir(..., "cache") XDG-aware mechanism. |
| packages/utils/CHANGELOG.md | Documents the new dirs helper in utils’ Unreleased section. |
| packages/coding-agent/test/utils/markit-cache.test.ts | Adds unit coverage for Markit conversion caching semantics (hit/miss/skip, invalidation). |
| packages/coding-agent/test/tools/read-pdf-line-range.test.ts | Adds regression coverage ensuring cached markdown is reused across repeated reads of an unchanged PDF. |
| packages/coding-agent/src/utils/markit.ts | Wires caching into convertFileWithMarkit / convertBufferWithMarkit, skipping cache for imageDir conversions. |
| packages/coding-agent/src/utils/markit-cache.ts | Implements the on-disk JSON cache entries, keying, and bounded pruning. |
| packages/coding-agent/package.json | Exposes ./markit exports for external/test imports. |
| packages/coding-agent/CHANGELOG.md | Notes the user-facing behavior change (cached successful document conversions). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| function parseCacheEntry(raw: string): MarkitConversionCacheEntry | null { | ||
| const parsed: unknown = JSON.parse(raw); | ||
| if (typeof parsed !== "object" || parsed === null) return null; | ||
| if (!("version" in parsed) || parsed.version !== MARKIT_CONVERSION_CACHE_VERSION) return null; | ||
| if (!("content" in parsed) || typeof parsed.content !== "string") return null; | ||
| return { version: MARKIT_CONVERSION_CACHE_VERSION, content: parsed.content }; | ||
| } |
| for (const name of names) { | ||
| if (!name.endsWith(".json")) continue; | ||
| const entryPath = path.join(cacheDir, name); | ||
| try { | ||
| const stat = await fs.stat(entryPath); |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 445153f3bc
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| export async function readMarkitConversionCache( | ||
| key: string, | ||
| ): Promise<{ status: "hit"; content: string } | { status: "miss" }> { |
There was a problem hiding this comment.
Use Bun APIs for cache file reads
The repo’s /workspace/oh-my-pi/AGENTS.md says file reads/writes should use Bun.file()/Bun.write() and reserves node:fs/promises for directory operations. These cache entry reads (and the matching writes below) are plain file I/O, so this new utility violates that documented convention; please switch the cache payload I/O to Bun APIs while keeping fs for mkdir/rm/readdir/stat/rename.
Useful? React with 👍 / 👎.
roboomp
left a comment
There was a problem hiding this comment.
Thanks @wolfiesch — nicely scoped, well-tested addition. Caching is gated to successful, non-imageDir conversions, the key (content SHA-256 + normalised extension) is right, and the abort-safety + tmp-then-rename atomicity are correct. Tests cover hit/miss/skip, content invalidation, and the imageDir bypass.
Ranked P2: this turns on a new ~256 MiB on-disk cache by default for every markit caller (read tool, @file mentions, fetched-doc buffers), so the maintainer call is whether that default-on disk footprint is OK without a settings toggle to disable it.
Headline findings (inline):
MARKIT_CONVERSION_CACHE_VERSIONis a hand-bumped constant; nothing ties it to the markit converters whose output it caches across package upgrades.pruneMarkitConversionCacheonly sees.json, so.tmporphans from a SIGKILL betweenwriteFileandrenameaccumulate outside the 256 MiB accounting.- Eviction is FIFO by mtime, not LRU; readers don't bump mtime.
- Prune blocks the miss path — could be fire-and-forget.
No correctness blockers in the diff itself.
| import * as path from "node:path"; | ||
| import { getDocumentConversionCacheDir, isEnoent, logger } from "@oh-my-pi/pi-utils"; | ||
|
|
||
| export const MARKIT_CONVERSION_CACHE_VERSION = 1; |
There was a problem hiding this comment.
should-fix — MARKIT_CONVERSION_CACHE_VERSION is a hand-bumped constant. The on-disk cache survives package upgrades (it lives in ~/.omp/agent/cache/document-conversions / $XDG_CACHE_HOME/omp/...), but the converters in src/markit/converters/* can evolve their markdown output between releases (heading rules, image placeholder shape, PDF page boundary handling, etc.) without anyone here remembering to bump this. The result: stale, formerly-correct markdown returned forever to upgraded clients.
Two cheap options:
- Bind the version to
package.jsonversion(or the markit subpackage hash) so an upgrade automatically invalidates; - Or at minimum: add a comment naming the bump policy + the exact paths that, when touched, require bumping. The cache key (
v${VERSION}-...) +parseCacheEntryalready make a bump safely self-evicting, so the only missing piece is the discipline trigger.
| logger.debug("document conversion cache write failed", { error: errorMessage(error) }); | ||
| return; | ||
| } | ||
|
|
There was a problem hiding this comment.
should-fix — .tmp orphans leak. The write path uses ${key}.${pid}.${Date.now()}.tmp then rename, but on a crash/SIGKILL between writeFile and rename the .tmp file stays forever: pruneMarkitConversionCache only walks files matching name.endsWith(".json") (line 81), so these are never seen by the size accounting or the eviction sweep. Over many process kills (think mid-conversion abort during a long agent loop) the cache dir grows beyond the 256 MiB cap unobservably to prune.
Fix shape: in pruneMarkitConversionCache, also collect .tmp files older than, say, a few minutes and fs.rm them before computing the eviction set. Same handful of lines, keeps the cache contract honest.
| } | ||
|
|
||
| if (totalBytes <= MAX_MARKIT_CONVERSION_CACHE_BYTES) return; | ||
|
|
There was a problem hiding this comment.
nit — Eviction is FIFO by mtimeMs, not LRU. A frequently-hit cached PDF whose write predates a never-re-read recent miss is evicted first. readMarkitConversionCache would need to touch utimes (or, cheaper, only mtime) on a hit to convert this to LRU. Skip if you're treating the cap as a coarse safety valve; worth a one-liner comment either way so the policy is on-record.
| const finalized = finalizeConversion(result.markdown); | ||
| if (finalized.ok && cacheKey) { | ||
| await writeMarkitConversionCache(cacheKey, finalized.content); | ||
| } |
There was a problem hiding this comment.
nit — writeMarkitConversionCache awaits pruneMarkitConversionCache, and the caller here awaits the whole write. So on every cache miss (i.e. the slow path the cache exists to amortise) you pay readdir + N×stat of the cache directory before returning the converted markdown to the agent. Prune is bounded and small in practice, but conceptually it could be fire-and-forget — the markdown is already on disk under its final name, the prune is just GC.
|
addressed in 13aee15:
On the Copilot "type error" note for Verification: 24 affected coding-agent tests + 2 utils dir tests pass; |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 13aee154ee
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } | ||
| return { content: "", ok: false, error: normalizeError(error) }; | ||
| } | ||
| return runCachedBufferConversion(buffer, streamInfo, signal, true); |
There was a problem hiding this comment.
Isolate side-effecting conversions from cache hits
When the same stable buffer is converted with the default agent dir, this persistent cache path makes later runs return before Markit.convert executes. That makes existing side-effect coverage such as test/utils/markit-mupdf-warnings.test.ts order-/rerun-dependent: after the first successful run caches warningPdf(), the next run skips MuPDF and the expected stderr logger call never happens. Please bypass/isolate the cache for side-effect assertions or provide an uncached conversion path.
Useful? React with 👍 / 👎.
Repeated reads of unchanged PDFs, Office documents, and EPUBs re-ran the full markit conversion every time. Add a transparent, content-addressed cache for successful conversions keyed by SHA-256(content) + normalized extension, so repeat reads reuse converted markdown instead of reconverting. - packages/utils: XDG-aware getDocumentConversionCacheDir() helper - coding-agent: markit-cache module (bounded 256 MiB, oldest-first prune, best-effort writes that never fail conversion) layered over the central convertFileWithMarkit/convertBufferWithMarkit wrappers - imageDir conversions stay uncached (cache:"skipped") to preserve PDF image extraction side effects; failed/empty/aborted conversions are never cached - abort-safe: file byte reads run under untilAborted; cache I/O rechecks the signal
- fold coding-agent package version into the cache key so releases that change markit converter output auto-invalidate stale entries - sweep orphaned `.tmp` files during prune (crash between write and rename previously leaked, invisible to the size cap) - make prune fire-and-forget after rename so a cache miss returns once the entry is on disk instead of waiting on a readdir + N×stat sweep - document the FIFO-by-mtime eviction policy on-record - use Bun.file()/Bun.write() for payload I/O per repo conventions
730e7e6 to
e58096a
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e58096a1e9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| afterEach(async () => { | ||
| vi.restoreAllMocks(); | ||
| setAgentDir(originalAgentDir); |
There was a problem hiding this comment.
Restore PI_CODING_AGENT_DIR after cache tests
When PI_CODING_AGENT_DIR is initially unset, this cleanup leaves it set to the default path because setAgentDir() always writes the env var. Since these tests set temporary agent dirs, later test files in the same Bun process can run with a persistent agent-dir override, violating the repo’s full-suite-safe test guidance and changing profile/XDG resolution. Please capture the original env value and delete or restore it along with resetting the resolver.
Useful? React with 👍 / 👎.
| } else { | ||
| process.env.XDG_CACHE_HOME = originalXdgCacheHome; | ||
| } | ||
| setAgentDir(originalAgentDir); |
There was a problem hiding this comment.
Restore PI_CODING_AGENT_DIR after XDG dir tests
This cleanup restores the resolved agent path but not whether PI_CODING_AGENT_DIR existed before the test; if it was unset, setAgentDir(originalAgentDir) creates a new long-lived override. That can make subsequent utils tests observe explicit-agent-dir behavior instead of the default/XDG path behavior they are trying to cover, so please restore/delete the original env value and refresh the resolver.
Useful? React with 👍 / 👎.
- Add random UUID suffix to cache temp filenames to avoid same-pid/same-ms collisions - Export pruneMarkitConversionCache and cover orphaned .tmp sweeping with a regression test
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ef97b63313
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| expect(convert).toHaveBeenCalledTimes(1); | ||
| } finally { | ||
| setAgentDir(originalAgentDir); |
There was a problem hiding this comment.
Restore the agent-dir environment after this cache test
When PI_CODING_AGENT_DIR is initially unset or a profile env is active, restoring with setAgentDir(originalAgentDir) does not put the process back the way it was: setAgentDir writes PI_CODING_AGENT_DIR and clears profile variables. This new test can therefore leak an explicit default agent-dir override into later files in the same Bun process, making full-suite results depend on test order; capture and restore/delete the relevant env values (then refresh/reset the resolver) instead of only restoring the resolved path.
Useful? React with 👍 / 👎.
What
Add a transparent, bounded, content-addressed cache for successful markit document conversions, plus an XDG-aware cache directory helper in
@oh-my-pi/pi-utils. Repeated reads of unchanged PDFs, Office documents, and EPUBs now reuse converted markdown instead of rerunning the full markit conversion.The cache sits behind the central
convertFileWithMarkit/convertBufferWithMarkitwrappers, so every caller (readtool, CLI@file, fetched document buffers) benefits without callsite changes. Keys areSHA-256(content) + normalized extension, so identical bytes hit regardless of path or session, and changed bytes miss automatically.Why
read report.pdf, thenread report.pdf:50-120, thenread report.pdf:200-260each reconverted the same document. Conversion is deterministic and can cost real time (hundreds of ms for PDFs). Caching turns repeat work into a hash + small JSON read.Design notes:
imageDirconversions are never cached — this preserves existing failure recovery and the read tool's PDF image-extraction side effects (a markdown-only hit would leaveimageDirmissing members).untilAborted; cache I/O rechecks the signal before returning a hit.Testing
All pass. New tests (written test-first) cover: cache dir resolution (XDG + custom agent dir), buffer/file hit-miss-skip, content-hash invalidation, no-cache-on-failure, imageDir skip, and cached markdown reuse across full + selector PDF reads.
Local benchmark (generated fixtures, machine-relative)
convertFileWithMarkit(pdf)convertBufferWithMarkit(docx)ReadTool(pdf full -> selector)After the change, first calls report
cache=missand repeats become internal cache hits. The benchmark script was local-only and is not included in the diff.bun checkpasses