feat(coding-agent): cache successful document conversions by wolfiesch · Pull Request #3264 · can1357/oh-my-pi

wolfiesch · 2026-06-22T14:51:09Z

What

Add a transparent, bounded, content-addressed cache for successful markit document conversions, plus an XDG-aware cache directory helper in @oh-my-pi/pi-utils. Repeated reads of unchanged PDFs, Office documents, and EPUBs now reuse converted markdown instead of rerunning the full markit conversion.

The cache sits behind the central convertFileWithMarkit / convertBufferWithMarkit wrappers, so every caller (read tool, CLI @file, fetched document buffers) benefits without callsite changes. Keys are SHA-256(content) + normalized extension, so identical bytes hit regardless of path or session, and changed bytes miss automatically.

Why

read report.pdf, then read report.pdf:50-120, then read report.pdf:200-260 each reconverted the same document. Conversion is deterministic and can cost real time (hundreds of ms for PDFs). Caching turns repeat work into a hash + small JSON read.

Design notes:

Successful conversions only. Failed, empty, aborted, and imageDir conversions are never cached — this preserves existing failure recovery and the read tool's PDF image-extraction side effects (a markdown-only hit would leave imageDir missing members).
Bounded. 256 MiB cap with oldest-first pruning after writes. Cache read/write/prune failures degrade silently and never fail a conversion.
Abort-safe. File byte reads run under untilAborted; cache I/O rechecks the signal before returning a hit.

Testing

bun --cwd=packages/utils test test/dirs-cache.test.ts
bun --cwd=packages/coding-agent test test/utils/markit-cache.test.ts test/tools/read-pdf-line-range.test.ts test/tools/read-pdf-images.test.ts test/markit-converters.test.ts test/issue-1401-repro.test.ts
bun --cwd=packages/utils run check
bun --cwd=packages/coding-agent run check
bun run check:ts

All pass. New tests (written test-first) cover: cache dir resolution (XDG + custom agent dir), buffer/file hit-miss-skip, content-hash invalidation, no-cache-on-failure, imageDir skip, and cached markdown reuse across full + selector PDF reads.

Local benchmark (generated fixtures, machine-relative)

Scenario	Before repeat ms/call	After repeat ms/call	Reduction
`convertFileWithMarkit(pdf)`	2.10	0.07	~97%
`convertBufferWithMarkit(docx)`	3.32	0.06	~98%
`ReadTool(pdf full -> selector)`	1.36	0.27	~80%

After the change, first calls report cache=miss and repeats become internal cache hits. The benchmark script was local-only and is not included in the diff.

bun check passes
Tested locally
CHANGELOG updated (if user-facing)

Copilot

Pull request overview

Adds a transparent, bounded, content-addressed cache for successful Markit document conversions in the coding-agent, plus a new XDG-aware cache directory helper in @oh-my-pi/pi-utils, so repeated reads of unchanged documents reuse previously converted markdown.

Changes:

Introduces a content-hash + normalized-extension cache layer for Markit conversions (buffer + file wrappers), with best-effort bounded pruning.
Adds getDocumentConversionCacheDir() to @oh-my-pi/pi-utils/dirs (XDG-aware) and documents it in the utils changelog.
Adds/updates tests to validate cache behavior (hit/miss/skip/invalidation) and reuse across full + selector PDF reads.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
packages/utils/test/dirs-cache.test.ts	Tests XDG vs custom agent-dir resolution for the new document conversion cache directory helper.
packages/utils/src/dirs.ts	Adds `getDocumentConversionCacheDir()` using the existing `agentSubdir(..., "cache")` XDG-aware mechanism.
packages/utils/CHANGELOG.md	Documents the new dirs helper in utils’ Unreleased section.
packages/coding-agent/test/utils/markit-cache.test.ts	Adds unit coverage for Markit conversion caching semantics (hit/miss/skip, invalidation).
packages/coding-agent/test/tools/read-pdf-line-range.test.ts	Adds regression coverage ensuring cached markdown is reused across repeated reads of an unchanged PDF.
packages/coding-agent/src/utils/markit.ts	Wires caching into `convertFileWithMarkit` / `convertBufferWithMarkit`, skipping cache for `imageDir` conversions.
packages/coding-agent/src/utils/markit-cache.ts	Implements the on-disk JSON cache entries, keying, and bounded pruning.
packages/coding-agent/package.json	Exposes `./markit` exports for external/test imports.
packages/coding-agent/CHANGELOG.md	Notes the user-facing behavior change (cached successful document conversions).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+function parseCacheEntry(raw: string): MarkitConversionCacheEntry | null {
+	const parsed: unknown = JSON.parse(raw);
+	if (typeof parsed !== "object" || parsed === null) return null;
+	if (!("version" in parsed) || parsed.version !== MARKIT_CONVERSION_CACHE_VERSION) return null;
+	if (!("content" in parsed) || typeof parsed.content !== "string") return null;
+	return { version: MARKIT_CONVERSION_CACHE_VERSION, content: parsed.content };
+}


+	for (const name of names) {
+		if (!name.endsWith(".json")) continue;
+		const entryPath = path.join(cacheDir, name);
+		try {
+			const stat = await fs.stat(entryPath);


chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 445153f3bc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-22T14:57:02Z

+
+export async function readMarkitConversionCache(
+	key: string,
+): Promise<{ status: "hit"; content: string } | { status: "miss" }> {


Use Bun APIs for cache file reads

The repo’s /workspace/oh-my-pi/AGENTS.md says file reads/writes should use Bun.file()/Bun.write() and reserves node:fs/promises for directory operations. These cache entry reads (and the matching writes below) are plain file I/O, so this new utility violates that documented convention; please switch the cache payload I/O to Bun APIs while keeping fs for mkdir/rm/readdir/stat/rename.

Useful? React with 👍 / 👎.

roboomp

Thanks @wolfiesch — nicely scoped, well-tested addition. Caching is gated to successful, non-imageDir conversions, the key (content SHA-256 + normalised extension) is right, and the abort-safety + tmp-then-rename atomicity are correct. Tests cover hit/miss/skip, content invalidation, and the imageDir bypass.

Ranked P2: this turns on a new ~256 MiB on-disk cache by default for every markit caller (read tool, @file mentions, fetched-doc buffers), so the maintainer call is whether that default-on disk footprint is OK without a settings toggle to disable it.

Headline findings (inline):

MARKIT_CONVERSION_CACHE_VERSION is a hand-bumped constant; nothing ties it to the markit converters whose output it caches across package upgrades.
pruneMarkitConversionCache only sees .json, so .tmp orphans from a SIGKILL between writeFile and rename accumulate outside the 256 MiB accounting.
Eviction is FIFO by mtime, not LRU; readers don't bump mtime.
Prune blocks the miss path — could be fire-and-forget.

No correctness blockers in the diff itself.

roboomp · 2026-06-22T14:57:20Z

+import * as path from "node:path";
+import { getDocumentConversionCacheDir, isEnoent, logger } from "@oh-my-pi/pi-utils";
+
+export const MARKIT_CONVERSION_CACHE_VERSION = 1;


should-fix — MARKIT_CONVERSION_CACHE_VERSION is a hand-bumped constant. The on-disk cache survives package upgrades (it lives in ~/.omp/agent/cache/document-conversions / $XDG_CACHE_HOME/omp/...), but the converters in src/markit/converters/* can evolve their markdown output between releases (heading rules, image placeholder shape, PDF page boundary handling, etc.) without anyone here remembering to bump this. The result: stale, formerly-correct markdown returned forever to upgraded clients.

Two cheap options:

Bind the version to package.json version (or the markit subpackage hash) so an upgrade automatically invalidates;

Or at minimum: add a comment naming the bump policy + the exact paths that, when touched, require bumping. The cache key (v${VERSION}-...) + parseCacheEntry already make a bump safely self-evicting, so the only missing piece is the discipline trigger.

roboomp · 2026-06-22T14:57:20Z

+		logger.debug("document conversion cache write failed", { error: errorMessage(error) });
+		return;
+	}
+


should-fix — .tmp orphans leak. The write path uses ${key}.${pid}.${Date.now()}.tmp then rename, but on a crash/SIGKILL between writeFile and rename the .tmp file stays forever: pruneMarkitConversionCache only walks files matching name.endsWith(".json") (line 81), so these are never seen by the size accounting or the eviction sweep. Over many process kills (think mid-conversion abort during a long agent loop) the cache dir grows beyond the 256 MiB cap unobservably to prune.

Fix shape: in pruneMarkitConversionCache, also collect .tmp files older than, say, a few minutes and fs.rm them before computing the eviction set. Same handful of lines, keeps the cache contract honest.

roboomp · 2026-06-22T14:57:21Z

+	}
+
+	if (totalBytes <= MAX_MARKIT_CONVERSION_CACHE_BYTES) return;
+


nit — Eviction is FIFO by mtimeMs, not LRU. A frequently-hit cached PDF whose write predates a never-re-read recent miss is evicted first. readMarkitConversionCache would need to touch utimes (or, cheaper, only mtime) on a hit to convert this to LRU. Skip if you're treating the cap as a coarse safety valve; worth a one-liner comment either way so the policy is on-record.

roboomp · 2026-06-22T14:57:21Z

+	const finalized = finalizeConversion(result.markdown);
+	if (finalized.ok && cacheKey) {
+		await writeMarkitConversionCache(cacheKey, finalized.content);
+	}


nit — writeMarkitConversionCache awaits pruneMarkitConversionCache, and the caller here awaits the whole write. So on every cache miss (i.e. the slow path the cache exists to amortise) you pay readdir + N×stat of the cache directory before returning the converted markdown to the agent. Prune is bounded and small in practice, but conceptually it could be fire-and-forget — the markdown is already on disk under its final name, the prune is just GC.

wolfiesch · 2026-06-22T21:31:49Z

addressed in 13aee15:

Cache invalidation on converter changes (Copilot/Codex): the coding-agent package.json version is now folded into the cache key (v<schema>-<pkgVersion>-<ext>-<sha256>), so any release shipping new markit output auto-invalidates stale entries. The schema constant stays separate for on-disk format bumps.
Prune blocks the miss path (roboomp): writeMarkitConversionCache no longer awaits the prune — it's fire-and-forget after the rename (with .catch + debug logging), so a miss returns once the entry is on disk rather than waiting on a readdir + N×stat sweep.
Orphaned .tmp files (roboomp): prune now sweeps stale .tmp files (>5 min old) left by a crash/SIGKILL between write and rename. Previously they never became .json entries and were invisible to the size cap.
FIFO vs LRU (roboomp): kept FIFO-by-mtime (reads don't bump mtime) since the cap is a coarse disk-footprint safety valve, and documented the policy in a comment on-record.
Bun I/O conventions: payload read/write now use Bun.file()/Bun.write() per AGENTS.md; directory ops stay on node:fs/promises.

On the Copilot "type error" note for parsed.version: the code uses "version" in parsed narrowing rather than an inline cast, which type-checks clean and matches the repo's no-inline-cast rule, so I left it as-is.

Verification: 24 affected coding-agent tests + 2 utils dir tests pass; bun --cwd=packages/coding-agent run check, bun --cwd=packages/utils run check, and bun run check:ts all pass.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 13aee154ee

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-22T21:37:10Z

-		}
-		return { content: "", ok: false, error: normalizeError(error) };
-	}
+	return runCachedBufferConversion(buffer, streamInfo, signal, true);


Isolate side-effecting conversions from cache hits

When the same stable buffer is converted with the default agent dir, this persistent cache path makes later runs return before Markit.convert executes. That makes existing side-effect coverage such as test/utils/markit-mupdf-warnings.test.ts order-/rerun-dependent: after the first successful run caches warningPdf(), the next run skips MuPDF and the expected stderr logger call never happens. Please bypass/isolate the cache for side-effect assertions or provide an uncached conversion path.

Useful? React with 👍 / 👎.

Repeated reads of unchanged PDFs, Office documents, and EPUBs re-ran the full markit conversion every time. Add a transparent, content-addressed cache for successful conversions keyed by SHA-256(content) + normalized extension, so repeat reads reuse converted markdown instead of reconverting. - packages/utils: XDG-aware getDocumentConversionCacheDir() helper - coding-agent: markit-cache module (bounded 256 MiB, oldest-first prune, best-effort writes that never fail conversion) layered over the central convertFileWithMarkit/convertBufferWithMarkit wrappers - imageDir conversions stay uncached (cache:"skipped") to preserve PDF image extraction side effects; failed/empty/aborted conversions are never cached - abort-safe: file byte reads run under untilAborted; cache I/O rechecks the signal

- fold coding-agent package version into the cache key so releases that change markit converter output auto-invalidate stale entries - sweep orphaned `.tmp` files during prune (crash between write and rename previously leaked, invisible to the size cap) - make prune fire-and-forget after rename so a cache miss returns once the entry is on disk instead of waiting on a readdir + N×stat sweep - document the FIFO-by-mtime eviction policy on-record - use Bun.file()/Bun.write() for payload I/O per repo conventions

…t tests

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e58096a1e9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-23T00:01:06Z

+
+	afterEach(async () => {
+		vi.restoreAllMocks();
+		setAgentDir(originalAgentDir);


Restore PI_CODING_AGENT_DIR after cache tests

When PI_CODING_AGENT_DIR is initially unset, this cleanup leaves it set to the default path because setAgentDir() always writes the env var. Since these tests set temporary agent dirs, later test files in the same Bun process can run with a persistent agent-dir override, violating the repo’s full-suite-safe test guidance and changing profile/XDG resolution. Please capture the original env value and delete or restore it along with resetting the resolver.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-23T00:01:06Z

+		} else {
+			process.env.XDG_CACHE_HOME = originalXdgCacheHome;
+		}
+		setAgentDir(originalAgentDir);


Restore PI_CODING_AGENT_DIR after XDG dir tests

This cleanup restores the resolved agent path but not whether PI_CODING_AGENT_DIR existed before the test; if it was unset, setAgentDir(originalAgentDir) creates a new long-lived override. That can make subsequent utils tests observe explicit-agent-dir behavior instead of the default/XDG path behavior they are trying to cover, so please restore/delete the original env value and refresh the resolver.

Useful? React with 👍 / 👎.

- Add random UUID suffix to cache temp filenames to avoid same-pid/same-ms collisions - Export pruneMarkitConversionCache and cover orphaned .tmp sweeping with a regression test

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ef97b63313

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-23T00:49:16Z

+
+			expect(convert).toHaveBeenCalledTimes(1);
+		} finally {
+			setAgentDir(originalAgentDir);


Restore the agent-dir environment after this cache test

When PI_CODING_AGENT_DIR is initially unset or a profile env is active, restoring with setAgentDir(originalAgentDir) does not put the process back the way it was: setAgentDir writes PI_CODING_AGENT_DIR and clears profile variables. This new test can therefore leak an explicit default agent-dir override into later files in the same Bun process, making full-suite results depend on test order; capture and restore/delete the relevant env values (then refresh/reset the resolver) instead of only restoring the resolved path.

Useful? React with 👍 / 👎.

Copilot AI review requested due to automatic review settings June 22, 2026 14:51

github-actions Bot added the vouched Passed the vouch gate label Jun 22, 2026

Copilot started reviewing on behalf of wolfiesch June 22, 2026 14:51 View session

Copilot AI reviewed Jun 22, 2026

View reviewed changes

roboomp added feat review:p2 tool Tool behavior and integrations triaged labels Jun 22, 2026

chatgpt-codex-connector Bot reviewed Jun 22, 2026

View reviewed changes

roboomp reviewed Jun 22, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 22, 2026

View reviewed changes

wolfiesch added 3 commits June 22, 2026 16:54

fix(coding-agent): add uncached buffer-conversion path for side-effec…

e58096a

…t tests

wolfiesch force-pushed the wolfiesch/document-conversion-cache branch from 730e7e6 to e58096a Compare June 22, 2026 23:54

chatgpt-codex-connector Bot reviewed Jun 23, 2026

View reviewed changes

refactor(coding-agent): harden document conversion cache temp handling

ef97b63

- Add random UUID suffix to cache temp filenames to avoid same-pid/same-ms collisions - Export pruneMarkitConversionCache and cover orphaned .tmp sweeping with a regression test

chatgpt-codex-connector Bot reviewed Jun 23, 2026

View reviewed changes

		}

		if (totalBytes <= MAX_MARKIT_CONVERSION_CACHE_BYTES) return;

Conversation

wolfiesch commented Jun 22, 2026

What

Why

Testing

Local benchmark (generated fixtures, machine-relative)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

roboomp left a comment

Choose a reason for hiding this comment

Uh oh!

roboomp Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

roboomp Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

roboomp Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

roboomp Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

wolfiesch commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wolfiesch commented Jun 22, 2026 •

edited

Loading