Skip to content

feat(coding-agent): cache successful document conversions#3264

Open
wolfiesch wants to merge 4 commits into
can1357:mainfrom
wolfiesch:wolfiesch/document-conversion-cache
Open

feat(coding-agent): cache successful document conversions#3264
wolfiesch wants to merge 4 commits into
can1357:mainfrom
wolfiesch:wolfiesch/document-conversion-cache

Conversation

@wolfiesch

Copy link
Copy Markdown
Contributor

What

Add a transparent, bounded, content-addressed cache for successful markit document conversions, plus an XDG-aware cache directory helper in @oh-my-pi/pi-utils. Repeated reads of unchanged PDFs, Office documents, and EPUBs now reuse converted markdown instead of rerunning the full markit conversion.

The cache sits behind the central convertFileWithMarkit / convertBufferWithMarkit wrappers, so every caller (read tool, CLI @file, fetched document buffers) benefits without callsite changes. Keys are SHA-256(content) + normalized extension, so identical bytes hit regardless of path or session, and changed bytes miss automatically.

Why

read report.pdf, then read report.pdf:50-120, then read report.pdf:200-260 each reconverted the same document. Conversion is deterministic and can cost real time (hundreds of ms for PDFs). Caching turns repeat work into a hash + small JSON read.

Design notes:

  • Successful conversions only. Failed, empty, aborted, and imageDir conversions are never cached — this preserves existing failure recovery and the read tool's PDF image-extraction side effects (a markdown-only hit would leave imageDir missing members).
  • Bounded. 256 MiB cap with oldest-first pruning after writes. Cache read/write/prune failures degrade silently and never fail a conversion.
  • Abort-safe. File byte reads run under untilAborted; cache I/O rechecks the signal before returning a hit.

Testing

bun --cwd=packages/utils test test/dirs-cache.test.ts
bun --cwd=packages/coding-agent test test/utils/markit-cache.test.ts test/tools/read-pdf-line-range.test.ts test/tools/read-pdf-images.test.ts test/markit-converters.test.ts test/issue-1401-repro.test.ts
bun --cwd=packages/utils run check
bun --cwd=packages/coding-agent run check
bun run check:ts

All pass. New tests (written test-first) cover: cache dir resolution (XDG + custom agent dir), buffer/file hit-miss-skip, content-hash invalidation, no-cache-on-failure, imageDir skip, and cached markdown reuse across full + selector PDF reads.

Local benchmark (generated fixtures, machine-relative)

Scenario Before repeat ms/call After repeat ms/call Reduction
convertFileWithMarkit(pdf) 2.10 0.07 ~97%
convertBufferWithMarkit(docx) 3.32 0.06 ~98%
ReadTool(pdf full -> selector) 1.36 0.27 ~80%

After the change, first calls report cache=miss and repeats become internal cache hits. The benchmark script was local-only and is not included in the diff.


  • bun check passes
  • Tested locally
  • CHANGELOG updated (if user-facing)

Copilot AI review requested due to automatic review settings June 22, 2026 14:51
@github-actions github-actions Bot added the vouched Passed the vouch gate label Jun 22, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a transparent, bounded, content-addressed cache for successful Markit document conversions in the coding-agent, plus a new XDG-aware cache directory helper in @oh-my-pi/pi-utils, so repeated reads of unchanged documents reuse previously converted markdown.

Changes:

  • Introduces a content-hash + normalized-extension cache layer for Markit conversions (buffer + file wrappers), with best-effort bounded pruning.
  • Adds getDocumentConversionCacheDir() to @oh-my-pi/pi-utils/dirs (XDG-aware) and documents it in the utils changelog.
  • Adds/updates tests to validate cache behavior (hit/miss/skip/invalidation) and reuse across full + selector PDF reads.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
packages/utils/test/dirs-cache.test.ts Tests XDG vs custom agent-dir resolution for the new document conversion cache directory helper.
packages/utils/src/dirs.ts Adds getDocumentConversionCacheDir() using the existing agentSubdir(..., "cache") XDG-aware mechanism.
packages/utils/CHANGELOG.md Documents the new dirs helper in utils’ Unreleased section.
packages/coding-agent/test/utils/markit-cache.test.ts Adds unit coverage for Markit conversion caching semantics (hit/miss/skip, invalidation).
packages/coding-agent/test/tools/read-pdf-line-range.test.ts Adds regression coverage ensuring cached markdown is reused across repeated reads of an unchanged PDF.
packages/coding-agent/src/utils/markit.ts Wires caching into convertFileWithMarkit / convertBufferWithMarkit, skipping cache for imageDir conversions.
packages/coding-agent/src/utils/markit-cache.ts Implements the on-disk JSON cache entries, keying, and bounded pruning.
packages/coding-agent/package.json Exposes ./markit exports for external/test imports.
packages/coding-agent/CHANGELOG.md Notes the user-facing behavior change (cached successful document conversions).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +29 to +35
function parseCacheEntry(raw: string): MarkitConversionCacheEntry | null {
const parsed: unknown = JSON.parse(raw);
if (typeof parsed !== "object" || parsed === null) return null;
if (!("version" in parsed) || parsed.version !== MARKIT_CONVERSION_CACHE_VERSION) return null;
if (!("content" in parsed) || typeof parsed.content !== "string") return null;
return { version: MARKIT_CONVERSION_CACHE_VERSION, content: parsed.content };
}
Comment on lines +80 to +84
for (const name of names) {
if (!name.endsWith(".json")) continue;
const entryPath = path.join(cacheDir, name);
try {
const stat = await fs.stat(entryPath);
@roboomp roboomp added feat review:p2 tool Tool behavior and integrations triaged labels Jun 22, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 445153f3bc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


export async function readMarkitConversionCache(
key: string,
): Promise<{ status: "hit"; content: string } | { status: "miss" }> {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use Bun APIs for cache file reads

The repo’s /workspace/oh-my-pi/AGENTS.md says file reads/writes should use Bun.file()/Bun.write() and reserves node:fs/promises for directory operations. These cache entry reads (and the matching writes below) are plain file I/O, so this new utility violates that documented convention; please switch the cache payload I/O to Bun APIs while keeping fs for mkdir/rm/readdir/stat/rename.

Useful? React with 👍 / 👎.

@roboomp roboomp left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wolfiesch — nicely scoped, well-tested addition. Caching is gated to successful, non-imageDir conversions, the key (content SHA-256 + normalised extension) is right, and the abort-safety + tmp-then-rename atomicity are correct. Tests cover hit/miss/skip, content invalidation, and the imageDir bypass.

Ranked P2: this turns on a new ~256 MiB on-disk cache by default for every markit caller (read tool, @file mentions, fetched-doc buffers), so the maintainer call is whether that default-on disk footprint is OK without a settings toggle to disable it.

Headline findings (inline):

  • MARKIT_CONVERSION_CACHE_VERSION is a hand-bumped constant; nothing ties it to the markit converters whose output it caches across package upgrades.
  • pruneMarkitConversionCache only sees .json, so .tmp orphans from a SIGKILL between writeFile and rename accumulate outside the 256 MiB accounting.
  • Eviction is FIFO by mtime, not LRU; readers don't bump mtime.
  • Prune blocks the miss path — could be fire-and-forget.

No correctness blockers in the diff itself.

import * as path from "node:path";
import { getDocumentConversionCacheDir, isEnoent, logger } from "@oh-my-pi/pi-utils";

export const MARKIT_CONVERSION_CACHE_VERSION = 1;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should-fixMARKIT_CONVERSION_CACHE_VERSION is a hand-bumped constant. The on-disk cache survives package upgrades (it lives in ~/.omp/agent/cache/document-conversions / $XDG_CACHE_HOME/omp/...), but the converters in src/markit/converters/* can evolve their markdown output between releases (heading rules, image placeholder shape, PDF page boundary handling, etc.) without anyone here remembering to bump this. The result: stale, formerly-correct markdown returned forever to upgraded clients.

Two cheap options:

  • Bind the version to package.json version (or the markit subpackage hash) so an upgrade automatically invalidates;
  • Or at minimum: add a comment naming the bump policy + the exact paths that, when touched, require bumping. The cache key (v${VERSION}-...) + parseCacheEntry already make a bump safely self-evicting, so the only missing piece is the discipline trigger.

logger.debug("document conversion cache write failed", { error: errorMessage(error) });
return;
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should-fix.tmp orphans leak. The write path uses ${key}.${pid}.${Date.now()}.tmp then rename, but on a crash/SIGKILL between writeFile and rename the .tmp file stays forever: pruneMarkitConversionCache only walks files matching name.endsWith(".json") (line 81), so these are never seen by the size accounting or the eviction sweep. Over many process kills (think mid-conversion abort during a long agent loop) the cache dir grows beyond the 256 MiB cap unobservably to prune.

Fix shape: in pruneMarkitConversionCache, also collect .tmp files older than, say, a few minutes and fs.rm them before computing the eviction set. Same handful of lines, keeps the cache contract honest.

}

if (totalBytes <= MAX_MARKIT_CONVERSION_CACHE_BYTES) return;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit — Eviction is FIFO by mtimeMs, not LRU. A frequently-hit cached PDF whose write predates a never-re-read recent miss is evicted first. readMarkitConversionCache would need to touch utimes (or, cheaper, only mtime) on a hit to convert this to LRU. Skip if you're treating the cap as a coarse safety valve; worth a one-liner comment either way so the policy is on-record.

const finalized = finalizeConversion(result.markdown);
if (finalized.ok && cacheKey) {
await writeMarkitConversionCache(cacheKey, finalized.content);
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitwriteMarkitConversionCache awaits pruneMarkitConversionCache, and the caller here awaits the whole write. So on every cache miss (i.e. the slow path the cache exists to amortise) you pay readdir + N×stat of the cache directory before returning the converted markdown to the agent. Prune is bounded and small in practice, but conceptually it could be fire-and-forget — the markdown is already on disk under its final name, the prune is just GC.

@wolfiesch

wolfiesch commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

addressed in 13aee15:

  • Cache invalidation on converter changes (Copilot/Codex): the coding-agent package.json version is now folded into the cache key (v<schema>-<pkgVersion>-<ext>-<sha256>), so any release shipping new markit output auto-invalidates stale entries. The schema constant stays separate for on-disk format bumps.
  • Prune blocks the miss path (roboomp): writeMarkitConversionCache no longer awaits the prune — it's fire-and-forget after the rename (with .catch + debug logging), so a miss returns once the entry is on disk rather than waiting on a readdir + N×stat sweep.
  • Orphaned .tmp files (roboomp): prune now sweeps stale .tmp files (>5 min old) left by a crash/SIGKILL between write and rename. Previously they never became .json entries and were invisible to the size cap.
  • FIFO vs LRU (roboomp): kept FIFO-by-mtime (reads don't bump mtime) since the cap is a coarse disk-footprint safety valve, and documented the policy in a comment on-record.
  • Bun I/O conventions: payload read/write now use Bun.file()/Bun.write() per AGENTS.md; directory ops stay on node:fs/promises.

On the Copilot "type error" note for parsed.version: the code uses "version" in parsed narrowing rather than an inline cast, which type-checks clean and matches the repo's no-inline-cast rule, so I left it as-is.

Verification: 24 affected coding-agent tests + 2 utils dir tests pass; bun --cwd=packages/coding-agent run check, bun --cwd=packages/utils run check, and bun run check:ts all pass.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 13aee154ee

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

}
return { content: "", ok: false, error: normalizeError(error) };
}
return runCachedBufferConversion(buffer, streamInfo, signal, true);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Isolate side-effecting conversions from cache hits

When the same stable buffer is converted with the default agent dir, this persistent cache path makes later runs return before Markit.convert executes. That makes existing side-effect coverage such as test/utils/markit-mupdf-warnings.test.ts order-/rerun-dependent: after the first successful run caches warningPdf(), the next run skips MuPDF and the expected stderr logger call never happens. Please bypass/isolate the cache for side-effect assertions or provide an uncached conversion path.

Useful? React with 👍 / 👎.

Repeated reads of unchanged PDFs, Office documents, and EPUBs re-ran the
full markit conversion every time. Add a transparent, content-addressed
cache for successful conversions keyed by SHA-256(content) + normalized
extension, so repeat reads reuse converted markdown instead of
reconverting.

- packages/utils: XDG-aware getDocumentConversionCacheDir() helper
- coding-agent: markit-cache module (bounded 256 MiB, oldest-first prune,
  best-effort writes that never fail conversion) layered over the central
  convertFileWithMarkit/convertBufferWithMarkit wrappers
- imageDir conversions stay uncached (cache:"skipped") to preserve PDF
  image extraction side effects; failed/empty/aborted conversions are
  never cached
- abort-safe: file byte reads run under untilAborted; cache I/O rechecks
  the signal
- fold coding-agent package version into the cache key so releases that
  change markit converter output auto-invalidate stale entries
- sweep orphaned `.tmp` files during prune (crash between write and
  rename previously leaked, invisible to the size cap)
- make prune fire-and-forget after rename so a cache miss returns once
  the entry is on disk instead of waiting on a readdir + N×stat sweep
- document the FIFO-by-mtime eviction policy on-record
- use Bun.file()/Bun.write() for payload I/O per repo conventions
@wolfiesch wolfiesch force-pushed the wolfiesch/document-conversion-cache branch from 730e7e6 to e58096a Compare June 22, 2026 23:54

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e58096a1e9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


afterEach(async () => {
vi.restoreAllMocks();
setAgentDir(originalAgentDir);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Restore PI_CODING_AGENT_DIR after cache tests

When PI_CODING_AGENT_DIR is initially unset, this cleanup leaves it set to the default path because setAgentDir() always writes the env var. Since these tests set temporary agent dirs, later test files in the same Bun process can run with a persistent agent-dir override, violating the repo’s full-suite-safe test guidance and changing profile/XDG resolution. Please capture the original env value and delete or restore it along with resetting the resolver.

Useful? React with 👍 / 👎.

} else {
process.env.XDG_CACHE_HOME = originalXdgCacheHome;
}
setAgentDir(originalAgentDir);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Restore PI_CODING_AGENT_DIR after XDG dir tests

This cleanup restores the resolved agent path but not whether PI_CODING_AGENT_DIR existed before the test; if it was unset, setAgentDir(originalAgentDir) creates a new long-lived override. That can make subsequent utils tests observe explicit-agent-dir behavior instead of the default/XDG path behavior they are trying to cover, so please restore/delete the original env value and refresh the resolver.

Useful? React with 👍 / 👎.

- Add random UUID suffix to cache temp filenames to avoid same-pid/same-ms collisions
- Export pruneMarkitConversionCache and cover orphaned .tmp sweeping with a regression test

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ef97b63313

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


expect(convert).toHaveBeenCalledTimes(1);
} finally {
setAgentDir(originalAgentDir);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Restore the agent-dir environment after this cache test

When PI_CODING_AGENT_DIR is initially unset or a profile env is active, restoring with setAgentDir(originalAgentDir) does not put the process back the way it was: setAgentDir writes PI_CODING_AGENT_DIR and clears profile variables. This new test can therefore leak an explicit default agent-dir override into later files in the same Bun process, making full-suite results depend on test order; capture and restore/delete the relevant env values (then refresh/reset the resolver) instead of only restoring the resolved path.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feat review:p2 tool Tool behavior and integrations triaged vouched Passed the vouch gate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants