feat: memorybench harness + AI-assistant memory craft#62
Merged
Conversation
6981a0c to
c0ff2ee
Compare
… FTS recall + batched embedder)
Four coordinated changes that improve Memento for AI assistants doing
memory work and ship a reusable public-dataset bench.
**Bench driver — \`scripts/bench.mjs\` + \`docs/guides/benchmark.md\`.**
Vanilla-Node ESM driver that builds Memento, stages a memorybench
fork at a pinned ref (or a local checkout via \`--memorybench-dir\`),
spawns one \`bun run src/index.ts run -p memento -b <bench>\` per
requested benchmark, and writes a single summary markdown to
\`bench/<ts>.md\`. Defaults to LoCoMo + LongMemEval with \`sonnet-4.6\`
pinned for judge + answering + distillation — the model class real
Memento users have on the conversation side via Claude Code, Cursor,
and Claude Desktop. Top-K=30, 180s indexing deadline. Per-phase
concurrency flags pass through to memorybench (\`--concurrency-ingest=1\`
is the safe knob against Anthropic's bursty Overloaded 529s and
also lets the provider's per-session distillation cache hit when
questions share sessions). The driver spawns the locally built CLI
via \`process.execPath\` and asserts \`better-sqlite3\` loads under
that exact Node before doing any expensive work, so a \`nvm + homebrew\`
PATH cocktail can't crash the run with a confusing "MCP error
-32000: Connection closed". \`--resume=<runId>\` picks up at the failed
phase of the failed question for a crashed run; the runId is logged
on a dedicated line and reprinted as a copy-pasteable command on any
non-zero exit. \`--out\` anchors to the Memento repo root regardless
of \`cwd\` so running from inside the fork checkout doesn't leak the
output directory into the fork worktree. Not part of \`pnpm verify\` —
needs network, judge API keys, and hours of wall-clock. The provider
implementation lives in a fork of memorybench at
veerps57/memorybench@add-memento-provider; the driver pins to that
ref so reproduction is exact.
**Write side — \`extract_memory\` contract clarity + distillation craft.**
The MCP tool description on \`extract_memory\` states the candidate-
shape difference from \`write_memory\` (flat \`kind\` enum, top-level
\`rationale\`/\`language\`), the \`topic: value\n\nprose\` requirement
for \`preference\`/\`decision\` kinds, the \`storedConfidence: 0.8\`
async-default, and the receipt-not-failure semantics of \`mode: "async"\`.
An inline example shows four kinds with the correct field placement,
including a \`preference\` candidate with the required topic-line and
a \`decision\` candidate with top-level \`rationale\`. \`TagSchema\` emits
an actionable error message listing the allowed charset instead of
a bare "Invalid".
The skill (\`skills/memento/SKILL.md\`), persona-snippet guide
(\`docs/guides/teach-your-assistant.md\`), and landing-page persona-
snippet mirror (\`packages/landing/src/App.tsx\`) carry a "Distillation
craft" section that frames the task as retrieval indexing for unknown
future queries (not summarisation for a reader) and codifies five
rules in priority order: preserve specific terms (proper nouns,
identity qualifiers, named entities, places, the specific object of
every action); emit a candidate for every dated event with the date
resolved against the session anchor, never collapsing it to an untimed
habit; capture precursor actions alongside outcomes ("researched X
then chose Y" emits two candidates since future questions can target
either step); don't squash enumerations into category labels; bias
toward inclusion (the server dedups via embedding similarity, so
over-including is cheap and under-including is permanent). A pre-emit
self-check ("did every date, named entity, and verb-with-specific-
object map to a candidate?") sits alongside the rules in each surface.
**Read side — porter stemming for FTS5.**
\`memories_fts\` is now built with \`tokenize='porter unicode61'\` instead
of the default \`unicode61\`. The chain runs right-to-left: unicode61
splits + diacritic-folds first (so non-ASCII content still tokenises
correctly), then porter stems the resulting ASCII tokens.
"colleague", "colleagues", and "colleague's" share a stem and match
each other; "bake" matches "baking" / "baked" / "bakes"; "research"
matches "researched" / "researches"; "agency" matches "agencies".
The \`retrieval.fts.tokenizer\` config key now defaults to \`porter\`
and is documented as honoured by the FTS index (it was previously
declared but ignored by the migration — dead-code tunability that
this change makes real). Migration 0008 drops and rebuilds
\`memories_fts\` with the new tokenizer, preserving stable rowids via
the \`memories_fts_map\` table; the runner applies it on first server
start after upgrade, so no operator action is required.
Six new unit tests in \`0008_fts_porter_tokenizer.test.ts\` cover the
acceptance criteria: stem variants match across plural/singular and
verb-form pairs, pre-migration memories are re-indexed under the new
tokenizer, the insert/update/delete triggers carry the new tokenizer
through write-path operations, and non-ASCII content (German umlauts,
French diacritics, Japanese katakana) survives the chain intact.
The trade-off accepted is porter's known over-stems
(organize/organic, universe/university). For Memento's dominant query
distribution — assistants asking about durable user state in natural
language — recall on stem variants is worth more than precision on
these edge cases. Operators who need the older behaviour can author a
follow-up migration; the config key documents the option.
**Embedder perf — real batched feature-extraction.**
\`@psraghuveer/memento-embedder-local\`'s \`embedBatch\` now uses
transformers.js v3's array-input pipeline, which runs one forward
pass for the whole batch instead of looping per text. Numerically
identical to the single-call form (verified row-by-row against the
same input). Measured ~1.8x speedup on a 3-input batch with
\`bge-base-en-v1.5\` on CPU; the speedup grows with batch size because
tokenisation and pipeline setup amortise across the batch. The
loader contract now returns \`{ embed, embedBatch? }\` instead of a
bare \`embed\` function; loaders that omit \`embedBatch\` fall back to
the previous sequential behaviour, so test fixtures and bespoke
implementations keep working unchanged. Seven new unit tests cover
the fast path, the sequential fallback, empty-input short-circuit,
runtime-row-count mismatch, per-row dimension validation, batched
\`maxInputBytes\` truncation, and whole-batch timeout. The
\`EmbeddingProvider.embedBatch\` surface in
\`@psraghuveer/memento-core\` is unchanged and remains optional;
existing call sites that go through \`embedBatchFallback\`
automatically pick up the fast path.
**Release note.** Single changeset bumps schema / core /
embedder-local / memento minor — the FTS-tokenizer default flips,
the \`extract_memory\` tool description shifts, and the embedder
batches in user-observable ways. \`memento-landing\` is private
(marketing site, not published to npm) and is added to the changesets
\`ignore\` list so it can't accidentally land in any future release
note. \`docs/reference/{cli,mcp-tools,config-keys}.md\` is
regenerated to reflect the new tool description, default, and error
message.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
c0ff2ee to
dbcf019
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Two needs motivated this work; running against the bench surfaced two more engine gaps that are shipped fixed in the same PR.
extract_memorycontract had several first-try-wrong footguns for AI assistants doing distillation. While building the benchmark provider, every distillation attempt failed silently because the candidate shape, the topic-line requirement, and the async-mode receipt semantics weren't surfaced in the MCP tool description (some were in the skill but invisible to a tool consumer reading onlytools/list). The tool's own example was itself non-validating. Tag-regex failures returned a bare(root): Invalid. These are the kind of friction that turns "give Memento a try" into "give Memento up."d24813b1, "tips on what to bake for colleagues") missed the gold-truth memory because the memory said "colleague's going-away party" and the query said "colleagues" — the defaultunicode61FTS5 tokenizer treats them as different tokens. Theretrieval.fts.tokenizerconfig key advertisedporteras a tunable alternative, but no migration or runtime code ever read it: dead-code tunability. Vector embedding rescued some morphological misses, but not enough, and the failure mode is exactly the one a durable memory layer needs to handle well (the speaker's wording and the future question's wording rarely match in surface form).embedder-local.embedBatchwas sequential under the hood. The implementation loopedextractor(text, ...)per row with a comment pointing at the transformers.js v2 limitation. transformers.js v3 (already pinned via^3.0.0) accepts an array input and runs one forward pass for the whole batch — verified row-by-row numerically identical to the single-call form.Change
Four coordinated workstreams ship together. The bench surfaced (3) and (4); the fixes that close them improve Memento for every assistant doing memory work, not just for the bench's score.
Bench driver —
scripts/bench.mjs+docs/guides/benchmark.md. A vanilla-Node ESM driver that builds Memento, stages a memorybench fork at a pinned ref (or a local checkout via--memorybench-dir), spawns onebun run src/index.ts run -p memento -b <bench>per requested benchmark, and writes a single summary markdown tobench/<ts>.md(thebench/directory is git-ignored). Defaults to LoCoMo + LongMemEval withsonnet-4.6pinned for judge + answering + distillation — the model class that actually shows up on the conversation side in real Memento usage (Claude Code, Cursor, and Claude Desktop are the MCP-using-client majority, andextract_memorydistillation happens in that same assistant). Sonnet 4.6 supportstemperature=0(deterministic at the model layer) and the alias is registered in the fork'sMODEL_CONFIGS. Top-K=30, 180s indexing deadline. Per-phase concurrency flags pass through to memorybench so a slow embedder or a throttled Anthropic endpoint can be tamed (--concurrency-ingest=1is the safe knob forsonnet-4.6under bursty rate-limit pressure, and it also lets the provider's per-session distillation cache hit when questions share sessions). The driver spawns the locally built CLI viaprocess.execPathand assertsbetter-sqlite3loads under that exact Node before doing any expensive work, so anvm + homebrewPATH cocktail can't crash the run with a confusing "MCP error -32000: Connection closed". A--resume=<runId>flag picks up at the failed phase of the failed question for a crashed run (memorybench's orchestrator checkpoints after every phase boundary); the runId is logged on a dedicated line in the bench log and reprinted as a copy-pasteable command on any non-zero exit.--outanchors to the Memento repo root regardless ofcwdso running from inside the fork checkout doesn't leak the output directory into the fork worktree. The provider implementation itself lives in a fork ofsupermemoryai/memorybench(veerps57/memorybench@add-memento-provider).extract_memorytool surface + distillation craft. The MCP tool description onextract_memorystates the candidate-shape difference fromwrite_memory(flatkindenum, top-levelrationale/language), thetopic: value\n\nproserequirement forpreference/decisionkinds, thestoredConfidence: 0.8async-default, and the receipt-not-failure semantics ofmode: "async". The inline example exercises four kinds with the correct field placement — including apreferencecandidate that opens with the required topic-line and adecisioncandidate with top-levelrationale.TagSchemacarries a custom error message listing the allowed charset soApril 15, 2026produces an actionable diagnostic instead of a bare "Invalid". The skill (skills/memento/SKILL.md), persona-snippet guide (docs/guides/teach-your-assistant.md), and the landing-page persona-snippet mirror (packages/landing/src/App.tsx) carry a "Distillation craft" section that frames the task as retrieval indexing for unknown future queries (not summarisation for a reader) and codifies six rules in priority order: (1) preserve specific terms — proper nouns, identity qualifiers, named entities, places, and the specific object of every action; (2) capture facts about every named participant, not only the user — a friend the user mentions ("my friend Alex is moving to Berlin for a SAP job") or a co-speaker in a multi-party transcript both deserve candidates attributed to the right named person, not collapsed onto the user; (3) emit a candidate for every dated event with the date resolved against the session anchor, never collapsing it to an untimed habit; (4) capture precursor actions alongside outcomes — "researched X then chose Y" emits two candidates, since future questions can target either step; (5) don't squash enumerations into category labels; (6) bias toward inclusion — the server dedups via embedding similarity, so over-including is cheap and under-including is permanent. A pre-emit self-check ("did every date, named entity, and verb-with-specific-object map to a candidate?") sits alongside the rules in each surface.Porter stemming for FTS5 — migration 0008 + honoured config + default flip.
memories_ftsis now built withtokenize='porter unicode61'instead of the defaultunicode61. The chain runs right-to-left: unicode61 splits + diacritic-folds first (so non-ASCII content still tokenises correctly — German umlauts, French diacritics, Japanese katakana all survive intact, covered by tests), then porter stems the resulting ASCII tokens. "colleague", "colleagues", and "colleague's" share a stem and match each other; "bake" matches "baking" / "baked" / "bakes"; "research" matches "researched" / "researches"; "agency" matches "agencies". Theretrieval.fts.tokenizerconfig key now defaults toporterand is documented as honoured by the FTS index (it was previously declared but ignored by the migration — dead-code tunability that this change makes real). Migration 0008 drops and rebuildsmemories_ftswith the new tokenizer, preserving stable rowids via thememories_fts_maptable; the runner applies it on first server start after upgrade, so no operator action is required. Six new unit tests cover stem-variant matching (plural/singular, verb-form pairs), pre-migration re-indexing, the insert/update/delete triggers carrying the new tokenizer through write-path operations, and non-ASCII preservation. The trade-off accepted is porter's known over-stems (organize/organic, universe/university); for Memento's dominant query distribution — assistants asking about durable user state in natural language — recall on stem variants is worth more than precision on these edge cases. Operators who need the older behaviour can author a follow-up migration; the config key documents the option.Embedder perf — real batched feature-extraction in
@psraghuveer/memento-embedder-local.embedBatchnow uses transformers.js v3's array-input pipeline, which runs one forward pass for the whole batch instead of looping per text. Numerically identical to the single-call form (verified row-by-row against the same input). Measured ~1.8× speedup on a 3-input batch withbge-base-en-v1.5on CPU; the speedup grows with batch size because tokenisation and pipeline setup amortise across the batch. The loader contract now returns{ embed, embedBatch? }instead of a bareembedfunction; loaders that omitembedBatchfall back to the previous sequential behaviour, so test fixtures and bespoke implementations keep working unchanged. Seven new unit tests cover the fast path, the sequential fallback, empty-input short-circuit, runtime-row-count mismatch, per-row dimension validation, batchedmaxInputBytestruncation, and whole-batch timeout. TheEmbeddingProvider.embedBatchsurface in@psraghuveer/memento-coreis unchanged and remains optional; existing call sites that go throughembedBatchFallback(pack.install,import,embedding.rebuild, the synchronous extract slow-path) automatically pick up the fast path.Justification against the four principles
DEFAULTSat the top ofbench.mjs. The tool-description changes surface constraints that already existed in code (schema validation, the conflict-detector's topic-line parsing) where an assistant readingtools/listwill see them. The FTS-tokenizer change does flip a default (retrieval.fts.tokenizer: unicode61 → porter), but the migration that effects it is the canonical Memento way of evolving stored state — and the config key that controls it has shipped since the registry release and is now actually honoured. The embedder change is purely a perf path; semantics are byte-identical to the previous behaviour.scripts/bench.mjsis a thin driver — the provider lives in a fork of memorybench, the harness is memorybench's, and the judge/answering models are memorybench's. The Memento side is one script + one guide + a pinned fork ref. The distillation-clarity changes touch documentation and a single Zod error message; no behavioural code paths are added. The FTS change is a single migration file + one config-key default flip. The embedder change extends the existingEmbedRuntimeshape with an optionalembedBatchand adds a single fast-path branch in the wrapper; the rest is sequential-fallback compatibility.DEFAULTS.benchmarks. Adding a different judge family means pointing--judgeat a different model alias; the script's API-key check fans out by family. The skill's distillation-craft section is positioned so a future contributor can extend the rules without restructuring. The FTS migration's pattern (drop → rebuild → repopulate → retrigger) is the same as 0005's — future tokenizer changes follow the same template. The loader contract's optionalembedBatchlets bespoke embedders opt into batching when their runtime supports it, without forcing a contract upgrade on the others.retrieval.fts.tokenizer— operators can stay onunicode61by setting it before first server start and recreating the FTS table via a follow-up migration. The embedder change adds no new config key (the runtime contract change is internal); operators with custom loaders are unaffected by default.Alternatives considered
extract_memoryitself. Rejected: Memento's architectural commitment is local-first and LLM-agnostic. Baking in an LLM would either pull in a cloud provider (breaking local-first) or ship a bundled local model (breaking LLM-agnostic and adding ops complexity). Distillation belongs to the calling AI assistant, where the conversation context lives. The bench provider does its own distill step to mirror that flow.write_memoryandextract_memoryaccept the same payload. Considered, rejected: the discriminated-union shape onwrite_memoryis the right design for a single-row call where kind-specific metadata is the point; the flat shape onextract_memoryis the right design for a batch where the per-item type is data, not a routing tag. Documenting the difference is correct; collapsing them would weaken both APIs.unicode61as the FTS default and ship porter as an opt-in only. Rejected: theretrieval.fts.tokenizerconfig key was already documented as the operator-tunable knob, and validation of the porter path on a real bench question showed unicode61 missing the gold-truth memory at the FTS layer entirely. The migration is the right place to flip the default because anyone who actively wants unicode61 can author a follow-up migration; the silent majority who never touched the key get a measurable recall improvement.dtype: 'q8'), worker thread, WebGPU. Deferred: quantisation is a recall trade-off that needs its own evaluation pass; worker threads improve event-loop responsiveness without raising throughput on a single CPU; WebGPU only helps browser hosts (Memento runs on Node). Real batched feature-extraction is the largest no-trade-off win available today, so it's the one shipped here.Tests
0008_fts_porter_tokenizeris forward-only, idempotent on a fresh DB, and verified end-to-end against a pre-0008 install viaMIGRATIONS.slice(0, 7)in the test suite.servee2e passes. The bench itself is the new end-to-end exercise but is not part ofpnpm verifyfor the reasons documented indocs/guides/benchmark.md(it needs network, judge API keys, and hours of wall-clock — CI must pass offline). A focused 1Q LongMemEval validation against the baking question confirmed the porter fix lifts that question from 0 → 1 correct, with the lemon-poppyseed memory ranking ci: switch to OIDC trusted publishing and pin actions to SHAs #4 in retrieval where previously it didn't reach top-30.Local verification
pnpm verify(lint → typecheck → build → test → test:e2e → docs:lint → docs:reflow:check → docs:links → docs:check → format:packs:check → server-json:check) — all green at branch HEAD.pnpm docs:generate— run;docs/reference/{cli,mcp-tools,config-keys}.md,AGENTS.md,CONTRIBUTING.md,.github/copilot-instructions.md, and.github/PULL_REQUEST_TEMPLATE.mdregenerated to pick up the newextract_memorydescription, theTagSchemaerror message, and theretrieval.fts.tokenizerdefault + description.ADR
The bench driver, the tool-description changes, the embedder fast path, and the FTS tokenizer migration are all within the ADR exemption list in
AGENTS.md:extract_memorytool-description andTagSchemaerror-message changes make existing contracts more discoverable without changing them.retrieval.fts.tokenizer). The default flip is operator-visible but it neither introduces a new behavioural constant nor changes a contract — it activates a knob that already shipped. Memento's stance on tokenizer choice was always "operator-configurable, default may evolve as the use case sharpens" (per the config key's description).AI involvement
The bench driver, the provider in the memorybench fork, the audit of
extract_memory's distillation-friction surface, the porter migration + tests, the embedder batching + tests, and the prose updates to the skill / persona guide / landing snippet were drafted with Claude. Every change was reviewed and exercised end-to-end against LoCoMo and LongMemEval smokes through the full pipeline (distill → write → indexing → search → answer → judge). The Zod error-message change and the tool-description text were verified against the actual code paths they describe. The porter fix specifically was validated by re-running the same failed bench question against the new code and confirming the gold-truth memory now ranks at the top of the retrieved set with the same models, same haystack, same scope.Linked issues
Corresponding memorybench PR: supermemoryai/memorybench#43