Skip to content

feat: memorybench harness + AI-assistant memory craft#62

Merged
veerps57 merged 1 commit into
mainfrom
feat/memorybench-and-distillation
May 15, 2026
Merged

feat: memorybench harness + AI-assistant memory craft#62
veerps57 merged 1 commit into
mainfrom
feat/memorybench-and-distillation

Conversation

@veerps57

@veerps57 veerps57 commented May 15, 2026

Copy link
Copy Markdown
Owner

Problem

Two needs motivated this work; running against the bench surfaced two more engine gaps that are shipped fixed in the same PR.

  1. Memento has no published end-to-end benchmark. The MCP-registry launch built credibility on architectural commitments and a clean API; the next step toward trust is "here are the numbers, here is how to reproduce them." We need a reusable harness against the de-facto industry datasets (LoCoMo, LongMemEval) so a sceptical engineer can re-run a baseline on their laptop and verify.
  2. Memento's extract_memory contract had several first-try-wrong footguns for AI assistants doing distillation. While building the benchmark provider, every distillation attempt failed silently because the candidate shape, the topic-line requirement, and the async-mode receipt semantics weren't surfaced in the MCP tool description (some were in the skill but invisible to a tool consumer reading only tools/list). The tool's own example was itself non-validating. Tag-regex failures returned a bare (root): Invalid. These are the kind of friction that turns "give Memento a try" into "give Memento up."
  3. FTS5 recall was stem-blind for prose queries. The bench's first low-score question (LongMemEval d24813b1, "tips on what to bake for colleagues") missed the gold-truth memory because the memory said "colleague's going-away party" and the query said "colleagues" — the default unicode61 FTS5 tokenizer treats them as different tokens. The retrieval.fts.tokenizer config key advertised porter as a tunable alternative, but no migration or runtime code ever read it: dead-code tunability. Vector embedding rescued some morphological misses, but not enough, and the failure mode is exactly the one a durable memory layer needs to handle well (the speaker's wording and the future question's wording rarely match in surface form).
  4. embedder-local.embedBatch was sequential under the hood. The implementation looped extractor(text, ...) per row with a comment pointing at the transformers.js v2 limitation. transformers.js v3 (already pinned via ^3.0.0) accepts an array input and runs one forward pass for the whole batch — verified row-by-row numerically identical to the single-call form.

Change

Four coordinated workstreams ship together. The bench surfaced (3) and (4); the fixes that close them improve Memento for every assistant doing memory work, not just for the bench's score.

  • Bench driver — scripts/bench.mjs + docs/guides/benchmark.md. A vanilla-Node ESM driver that builds Memento, stages a memorybench fork at a pinned ref (or a local checkout via --memorybench-dir), spawns one bun run src/index.ts run -p memento -b <bench> per requested benchmark, and writes a single summary markdown to bench/<ts>.md (the bench/ directory is git-ignored). Defaults to LoCoMo + LongMemEval with sonnet-4.6 pinned for judge + answering + distillation — the model class that actually shows up on the conversation side in real Memento usage (Claude Code, Cursor, and Claude Desktop are the MCP-using-client majority, and extract_memory distillation happens in that same assistant). Sonnet 4.6 supports temperature=0 (deterministic at the model layer) and the alias is registered in the fork's MODEL_CONFIGS. Top-K=30, 180s indexing deadline. Per-phase concurrency flags pass through to memorybench so a slow embedder or a throttled Anthropic endpoint can be tamed (--concurrency-ingest=1 is the safe knob for sonnet-4.6 under bursty rate-limit pressure, and it also lets the provider's per-session distillation cache hit when questions share sessions). The driver spawns the locally built CLI via process.execPath and asserts better-sqlite3 loads under that exact Node before doing any expensive work, so a nvm + homebrew PATH cocktail can't crash the run with a confusing "MCP error -32000: Connection closed". A --resume=<runId> flag picks up at the failed phase of the failed question for a crashed run (memorybench's orchestrator checkpoints after every phase boundary); the runId is logged on a dedicated line in the bench log and reprinted as a copy-pasteable command on any non-zero exit. --out anchors to the Memento repo root regardless of cwd so running from inside the fork checkout doesn't leak the output directory into the fork worktree. The provider implementation itself lives in a fork of supermemoryai/memorybench (veerps57/memorybench@add-memento-provider).

  • extract_memory tool surface + distillation craft. The MCP tool description on extract_memory states the candidate-shape difference from write_memory (flat kind enum, top-level rationale/language), the topic: value\n\nprose requirement for preference/decision kinds, the storedConfidence: 0.8 async-default, and the receipt-not-failure semantics of mode: "async". The inline example exercises four kinds with the correct field placement — including a preference candidate that opens with the required topic-line and a decision candidate with top-level rationale. TagSchema carries a custom error message listing the allowed charset so April 15, 2026 produces an actionable diagnostic instead of a bare "Invalid". The skill (skills/memento/SKILL.md), persona-snippet guide (docs/guides/teach-your-assistant.md), and the landing-page persona-snippet mirror (packages/landing/src/App.tsx) carry a "Distillation craft" section that frames the task as retrieval indexing for unknown future queries (not summarisation for a reader) and codifies six rules in priority order: (1) preserve specific terms — proper nouns, identity qualifiers, named entities, places, and the specific object of every action; (2) capture facts about every named participant, not only the user — a friend the user mentions ("my friend Alex is moving to Berlin for a SAP job") or a co-speaker in a multi-party transcript both deserve candidates attributed to the right named person, not collapsed onto the user; (3) emit a candidate for every dated event with the date resolved against the session anchor, never collapsing it to an untimed habit; (4) capture precursor actions alongside outcomes — "researched X then chose Y" emits two candidates, since future questions can target either step; (5) don't squash enumerations into category labels; (6) bias toward inclusion — the server dedups via embedding similarity, so over-including is cheap and under-including is permanent. A pre-emit self-check ("did every date, named entity, and verb-with-specific-object map to a candidate?") sits alongside the rules in each surface.

  • Porter stemming for FTS5 — migration 0008 + honoured config + default flip. memories_fts is now built with tokenize='porter unicode61' instead of the default unicode61. The chain runs right-to-left: unicode61 splits + diacritic-folds first (so non-ASCII content still tokenises correctly — German umlauts, French diacritics, Japanese katakana all survive intact, covered by tests), then porter stems the resulting ASCII tokens. "colleague", "colleagues", and "colleague's" share a stem and match each other; "bake" matches "baking" / "baked" / "bakes"; "research" matches "researched" / "researches"; "agency" matches "agencies". The retrieval.fts.tokenizer config key now defaults to porter and is documented as honoured by the FTS index (it was previously declared but ignored by the migration — dead-code tunability that this change makes real). Migration 0008 drops and rebuilds memories_fts with the new tokenizer, preserving stable rowids via the memories_fts_map table; the runner applies it on first server start after upgrade, so no operator action is required. Six new unit tests cover stem-variant matching (plural/singular, verb-form pairs), pre-migration re-indexing, the insert/update/delete triggers carrying the new tokenizer through write-path operations, and non-ASCII preservation. The trade-off accepted is porter's known over-stems (organize/organic, universe/university); for Memento's dominant query distribution — assistants asking about durable user state in natural language — recall on stem variants is worth more than precision on these edge cases. Operators who need the older behaviour can author a follow-up migration; the config key documents the option.

  • Embedder perf — real batched feature-extraction in @psraghuveer/memento-embedder-local. embedBatch now uses transformers.js v3's array-input pipeline, which runs one forward pass for the whole batch instead of looping per text. Numerically identical to the single-call form (verified row-by-row against the same input). Measured ~1.8× speedup on a 3-input batch with bge-base-en-v1.5 on CPU; the speedup grows with batch size because tokenisation and pipeline setup amortise across the batch. The loader contract now returns { embed, embedBatch? } instead of a bare embed function; loaders that omit embedBatch fall back to the previous sequential behaviour, so test fixtures and bespoke implementations keep working unchanged. Seven new unit tests cover the fast path, the sequential fallback, empty-input short-circuit, runtime-row-count mismatch, per-row dimension validation, batched maxInputBytes truncation, and whole-batch timeout. The EmbeddingProvider.embedBatch surface in @psraghuveer/memento-core is unchanged and remains optional; existing call sites that go through embedBatchFallback (pack.install, import, embedding.rebuild, the synchronous extract slow-path) automatically pick up the fast path.

Justification against the four principles

  • First principles. The benchmark driver introduces no new behavioural constants in Memento itself — every knob is a CLI flag or env var declared in DEFAULTS at the top of bench.mjs. The tool-description changes surface constraints that already existed in code (schema validation, the conflict-detector's topic-line parsing) where an assistant reading tools/list will see them. The FTS-tokenizer change does flip a default (retrieval.fts.tokenizer: unicode61 → porter), but the migration that effects it is the canonical Memento way of evolving stored state — and the config key that controls it has shipped since the registry release and is now actually honoured. The embedder change is purely a perf path; semantics are byte-identical to the previous behaviour.
  • Modular. scripts/bench.mjs is a thin driver — the provider lives in a fork of memorybench, the harness is memorybench's, and the judge/answering models are memorybench's. The Memento side is one script + one guide + a pinned fork ref. The distillation-clarity changes touch documentation and a single Zod error message; no behavioural code paths are added. The FTS change is a single migration file + one config-key default flip. The embedder change extends the existing EmbedRuntime shape with an optional embedBatch and adds a single fast-path branch in the wrapper; the rest is sequential-fallback compatibility.
  • Extensible. Adding a third benchmark (ConvoMem) is a one-line change to DEFAULTS.benchmarks. Adding a different judge family means pointing --judge at a different model alias; the script's API-key check fans out by family. The skill's distillation-craft section is positioned so a future contributor can extend the rules without restructuring. The FTS migration's pattern (drop → rebuild → repopulate → retrigger) is the same as 0005's — future tokenizer changes follow the same template. The loader contract's optional embedBatch lets bespoke embedders opt into batching when their runtime supports it, without forcing a contract upgrade on the others.
  • Config-driven. Every benchmark default (model, ref, limit, concurrency, search-K, indexing deadline) is overridable from the command line or env. The FTS-tokenizer choice is retrieval.fts.tokenizer — operators can stay on unicode61 by setting it before first server start and recreating the FTS table via a follow-up migration. The embedder change adds no new config key (the runtime contract change is internal); operators with custom loaders are unaffected by default.

Alternatives considered

  • Vendor memorybench inside the Memento repo. Rejected: keeps the harness external (so we don't own its release cadence) and lets the provider land as a normal contribution upstream. The driver pulls a pinned fork ref, so reproduction is exact.
  • Add LLM-driven distillation inside extract_memory itself. Rejected: Memento's architectural commitment is local-first and LLM-agnostic. Baking in an LLM would either pull in a cloud provider (breaking local-first) or ship a bundled local model (breaking LLM-agnostic and adding ops complexity). Distillation belongs to the calling AI assistant, where the conversation context lives. The bench provider does its own distill step to mirror that flow.
  • Re-design the candidate shape so write_memory and extract_memory accept the same payload. Considered, rejected: the discriminated-union shape on write_memory is the right design for a single-row call where kind-specific metadata is the point; the flat shape on extract_memory is the right design for a batch where the per-item type is data, not a routing tag. Documenting the difference is correct; collapsing them would weaken both APIs.
  • Keep unicode61 as the FTS default and ship porter as an opt-in only. Rejected: the retrieval.fts.tokenizer config key was already documented as the operator-tunable knob, and validation of the porter path on a real bench question showed unicode61 missing the gold-truth memory at the FTS layer entirely. The migration is the right place to flip the default because anyone who actively wants unicode61 can author a follow-up migration; the silent majority who never touched the key get a measurable recall improvement.
  • Heavier embedder optimisations — quantisation (dtype: 'q8'), worker thread, WebGPU. Deferred: quantisation is a recall trade-off that needs its own evaluation pass; worker threads improve event-loop responsiveness without raising throughput on a single CPU; WebGPU only helps browser hosts (Memento runs on Node). Real batched feature-extraction is the largest no-trade-off win available today, so it's the one shipped here.

Tests

  • Unit — full unit suite passes on this branch, plus 13 new tests (six for migration 0008 covering stem variants, pre-migration re-indexing, triggers, and non-ASCII preservation; seven for the embedder fast-path and sequential-fallback paths).
  • Integration — N/A; no new integration paths added beyond the existing extract path which is already integration-tested.
  • Migration — 0008_fts_porter_tokenizer is forward-only, idempotent on a fresh DB, and verified end-to-end against a pre-0008 install via MIGRATIONS.slice(0, 7) in the test suite.
  • End-to-end — the existing serve e2e passes. The bench itself is the new end-to-end exercise but is not part of pnpm verify for the reasons documented in docs/guides/benchmark.md (it needs network, judge API keys, and hours of wall-clock — CI must pass offline). A focused 1Q LongMemEval validation against the baking question confirmed the porter fix lifts that question from 0 → 1 correct, with the lemon-poppyseed memory ranking ci: switch to OIDC trusted publishing and pin actions to SHAs #4 in retrieval where previously it didn't reach top-30.
  • N/A — see above.

Local verification

  • pnpm verify (lint → typecheck → build → test → test:e2e → docs:lint → docs:reflow:check → docs:links → docs:check → format:packs:check → server-json:check) — all green at branch HEAD.
  • pnpm docs:generate — run; docs/reference/{cli,mcp-tools,config-keys}.md, AGENTS.md, CONTRIBUTING.md, .github/copilot-instructions.md, and .github/PULL_REQUEST_TEMPLATE.md regenerated to pick up the new extract_memory description, the TagSchema error message, and the retrieval.fts.tokenizer default + description.

ADR

  • An ADR is required and is included in this PR.
  • An ADR is required and exists already (link below).
  • No ADR required (explain why):

The bench driver, the tool-description changes, the embedder fast path, and the FTS tokenizer migration are all within the ADR exemption list in AGENTS.md:

  • The bench driver is optional tooling — it adds a script and a guide, doesn't change the public surface, the data model, scope semantics, or any top-level dependency.
  • The extract_memory tool-description and TagSchema error-message changes make existing contracts more discoverable without changing them.
  • The embedder fast path is a perf optimisation with byte-identical output; no semantic change.
  • The FTS tokenizer change is a forward-only migration that honours an already-documented config key (retrieval.fts.tokenizer). The default flip is operator-visible but it neither introduces a new behavioural constant nor changes a contract — it activates a knob that already shipped. Memento's stance on tokenizer choice was always "operator-configurable, default may evolve as the use case sharpens" (per the config key's description).

AI involvement

  • No AI assistance.
  • AI assistance for boilerplate / drafting only.
  • AI authored substantial portions. I have verified every line.

The bench driver, the provider in the memorybench fork, the audit of extract_memory's distillation-friction surface, the porter migration + tests, the embedder batching + tests, and the prose updates to the skill / persona guide / landing snippet were drafted with Claude. Every change was reviewed and exercised end-to-end against LoCoMo and LongMemEval smokes through the full pipeline (distill → write → indexing → search → answer → judge). The Zod error-message change and the tool-description text were verified against the actual code paths they describe. The porter fix specifically was validated by re-running the same failed bench question against the new code and confirming the gold-truth memory now ranks at the top of the retrieved set with the same models, same haystack, same scope.

Linked issues

Corresponding memorybench PR: supermemoryai/memorybench#43

@veerps57 veerps57 force-pushed the feat/memorybench-and-distillation branch from 6981a0c to c0ff2ee Compare May 15, 2026 05:36
… FTS recall + batched embedder)

Four coordinated changes that improve Memento for AI assistants doing
memory work and ship a reusable public-dataset bench.

**Bench driver — \`scripts/bench.mjs\` + \`docs/guides/benchmark.md\`.**

Vanilla-Node ESM driver that builds Memento, stages a memorybench
fork at a pinned ref (or a local checkout via \`--memorybench-dir\`),
spawns one \`bun run src/index.ts run -p memento -b <bench>\` per
requested benchmark, and writes a single summary markdown to
\`bench/<ts>.md\`. Defaults to LoCoMo + LongMemEval with \`sonnet-4.6\`
pinned for judge + answering + distillation — the model class real
Memento users have on the conversation side via Claude Code, Cursor,
and Claude Desktop. Top-K=30, 180s indexing deadline. Per-phase
concurrency flags pass through to memorybench (\`--concurrency-ingest=1\`
is the safe knob against Anthropic's bursty Overloaded 529s and
also lets the provider's per-session distillation cache hit when
questions share sessions). The driver spawns the locally built CLI
via \`process.execPath\` and asserts \`better-sqlite3\` loads under
that exact Node before doing any expensive work, so a \`nvm + homebrew\`
PATH cocktail can't crash the run with a confusing "MCP error
-32000: Connection closed". \`--resume=<runId>\` picks up at the failed
phase of the failed question for a crashed run; the runId is logged
on a dedicated line and reprinted as a copy-pasteable command on any
non-zero exit. \`--out\` anchors to the Memento repo root regardless
of \`cwd\` so running from inside the fork checkout doesn't leak the
output directory into the fork worktree. Not part of \`pnpm verify\` —
needs network, judge API keys, and hours of wall-clock. The provider
implementation lives in a fork of memorybench at
veerps57/memorybench@add-memento-provider; the driver pins to that
ref so reproduction is exact.

**Write side — \`extract_memory\` contract clarity + distillation craft.**

The MCP tool description on \`extract_memory\` states the candidate-
shape difference from \`write_memory\` (flat \`kind\` enum, top-level
\`rationale\`/\`language\`), the \`topic: value\n\nprose\` requirement
for \`preference\`/\`decision\` kinds, the \`storedConfidence: 0.8\`
async-default, and the receipt-not-failure semantics of \`mode: "async"\`.
An inline example shows four kinds with the correct field placement,
including a \`preference\` candidate with the required topic-line and
a \`decision\` candidate with top-level \`rationale\`. \`TagSchema\` emits
an actionable error message listing the allowed charset instead of
a bare "Invalid".

The skill (\`skills/memento/SKILL.md\`), persona-snippet guide
(\`docs/guides/teach-your-assistant.md\`), and landing-page persona-
snippet mirror (\`packages/landing/src/App.tsx\`) carry a "Distillation
craft" section that frames the task as retrieval indexing for unknown
future queries (not summarisation for a reader) and codifies five
rules in priority order: preserve specific terms (proper nouns,
identity qualifiers, named entities, places, the specific object of
every action); emit a candidate for every dated event with the date
resolved against the session anchor, never collapsing it to an untimed
habit; capture precursor actions alongside outcomes ("researched X
then chose Y" emits two candidates since future questions can target
either step); don't squash enumerations into category labels; bias
toward inclusion (the server dedups via embedding similarity, so
over-including is cheap and under-including is permanent). A pre-emit
self-check ("did every date, named entity, and verb-with-specific-
object map to a candidate?") sits alongside the rules in each surface.

**Read side — porter stemming for FTS5.**

\`memories_fts\` is now built with \`tokenize='porter unicode61'\` instead
of the default \`unicode61\`. The chain runs right-to-left: unicode61
splits + diacritic-folds first (so non-ASCII content still tokenises
correctly), then porter stems the resulting ASCII tokens.
"colleague", "colleagues", and "colleague's" share a stem and match
each other; "bake" matches "baking" / "baked" / "bakes"; "research"
matches "researched" / "researches"; "agency" matches "agencies".
The \`retrieval.fts.tokenizer\` config key now defaults to \`porter\`
and is documented as honoured by the FTS index (it was previously
declared but ignored by the migration — dead-code tunability that
this change makes real). Migration 0008 drops and rebuilds
\`memories_fts\` with the new tokenizer, preserving stable rowids via
the \`memories_fts_map\` table; the runner applies it on first server
start after upgrade, so no operator action is required.

Six new unit tests in \`0008_fts_porter_tokenizer.test.ts\` cover the
acceptance criteria: stem variants match across plural/singular and
verb-form pairs, pre-migration memories are re-indexed under the new
tokenizer, the insert/update/delete triggers carry the new tokenizer
through write-path operations, and non-ASCII content (German umlauts,
French diacritics, Japanese katakana) survives the chain intact.

The trade-off accepted is porter's known over-stems
(organize/organic, universe/university). For Memento's dominant query
distribution — assistants asking about durable user state in natural
language — recall on stem variants is worth more than precision on
these edge cases. Operators who need the older behaviour can author a
follow-up migration; the config key documents the option.

**Embedder perf — real batched feature-extraction.**

\`@psraghuveer/memento-embedder-local\`'s \`embedBatch\` now uses
transformers.js v3's array-input pipeline, which runs one forward
pass for the whole batch instead of looping per text. Numerically
identical to the single-call form (verified row-by-row against the
same input). Measured ~1.8x speedup on a 3-input batch with
\`bge-base-en-v1.5\` on CPU; the speedup grows with batch size because
tokenisation and pipeline setup amortise across the batch. The
loader contract now returns \`{ embed, embedBatch? }\` instead of a
bare \`embed\` function; loaders that omit \`embedBatch\` fall back to
the previous sequential behaviour, so test fixtures and bespoke
implementations keep working unchanged. Seven new unit tests cover
the fast path, the sequential fallback, empty-input short-circuit,
runtime-row-count mismatch, per-row dimension validation, batched
\`maxInputBytes\` truncation, and whole-batch timeout. The
\`EmbeddingProvider.embedBatch\` surface in
\`@psraghuveer/memento-core\` is unchanged and remains optional;
existing call sites that go through \`embedBatchFallback\`
automatically pick up the fast path.

**Release note.** Single changeset bumps schema / core /
embedder-local / memento minor — the FTS-tokenizer default flips,
the \`extract_memory\` tool description shifts, and the embedder
batches in user-observable ways. \`memento-landing\` is private
(marketing site, not published to npm) and is added to the changesets
\`ignore\` list so it can't accidentally land in any future release
note. \`docs/reference/{cli,mcp-tools,config-keys}.md\` is
regenerated to reflect the new tool description, default, and error
message.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@veerps57 veerps57 force-pushed the feat/memorybench-and-distillation branch from c0ff2ee to dbcf019 Compare May 15, 2026 05:37
@veerps57 veerps57 merged commit 0dc4716 into main May 15, 2026
14 checks passed
@veerps57 veerps57 deleted the feat/memorybench-and-distillation branch May 15, 2026 05:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant