From dbcf0198c1a87cbbfed35e73c312afcfef8607b4 Mon Sep 17 00:00:00 2001 From: Raghu Date: Thu, 14 May 2026 20:58:12 +0200 Subject: [PATCH] feat: memorybench harness + AI-assistant memory craft (distillation + FTS recall + batched embedder) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Four coordinated changes that improve Memento for AI assistants doing memory work and ship a reusable public-dataset bench. **Bench driver — \`scripts/bench.mjs\` + \`docs/guides/benchmark.md\`.** Vanilla-Node ESM driver that builds Memento, stages a memorybench fork at a pinned ref (or a local checkout via \`--memorybench-dir\`), spawns one \`bun run src/index.ts run -p memento -b \` per requested benchmark, and writes a single summary markdown to \`bench/.md\`. Defaults to LoCoMo + LongMemEval with \`sonnet-4.6\` pinned for judge + answering + distillation — the model class real Memento users have on the conversation side via Claude Code, Cursor, and Claude Desktop. Top-K=30, 180s indexing deadline. Per-phase concurrency flags pass through to memorybench (\`--concurrency-ingest=1\` is the safe knob against Anthropic's bursty Overloaded 529s and also lets the provider's per-session distillation cache hit when questions share sessions). The driver spawns the locally built CLI via \`process.execPath\` and asserts \`better-sqlite3\` loads under that exact Node before doing any expensive work, so a \`nvm + homebrew\` PATH cocktail can't crash the run with a confusing "MCP error -32000: Connection closed". \`--resume=\` picks up at the failed phase of the failed question for a crashed run; the runId is logged on a dedicated line and reprinted as a copy-pasteable command on any non-zero exit. \`--out\` anchors to the Memento repo root regardless of \`cwd\` so running from inside the fork checkout doesn't leak the output directory into the fork worktree. Not part of \`pnpm verify\` — needs network, judge API keys, and hours of wall-clock. The provider implementation lives in a fork of memorybench at veerps57/memorybench@add-memento-provider; the driver pins to that ref so reproduction is exact. **Write side — \`extract_memory\` contract clarity + distillation craft.** The MCP tool description on \`extract_memory\` states the candidate- shape difference from \`write_memory\` (flat \`kind\` enum, top-level \`rationale\`/\`language\`), the \`topic: value\n\nprose\` requirement for \`preference\`/\`decision\` kinds, the \`storedConfidence: 0.8\` async-default, and the receipt-not-failure semantics of \`mode: "async"\`. An inline example shows four kinds with the correct field placement, including a \`preference\` candidate with the required topic-line and a \`decision\` candidate with top-level \`rationale\`. \`TagSchema\` emits an actionable error message listing the allowed charset instead of a bare "Invalid". The skill (\`skills/memento/SKILL.md\`), persona-snippet guide (\`docs/guides/teach-your-assistant.md\`), and landing-page persona- snippet mirror (\`packages/landing/src/App.tsx\`) carry a "Distillation craft" section that frames the task as retrieval indexing for unknown future queries (not summarisation for a reader) and codifies five rules in priority order: preserve specific terms (proper nouns, identity qualifiers, named entities, places, the specific object of every action); emit a candidate for every dated event with the date resolved against the session anchor, never collapsing it to an untimed habit; capture precursor actions alongside outcomes ("researched X then chose Y" emits two candidates since future questions can target either step); don't squash enumerations into category labels; bias toward inclusion (the server dedups via embedding similarity, so over-including is cheap and under-including is permanent). A pre-emit self-check ("did every date, named entity, and verb-with-specific- object map to a candidate?") sits alongside the rules in each surface. **Read side — porter stemming for FTS5.** \`memories_fts\` is now built with \`tokenize='porter unicode61'\` instead of the default \`unicode61\`. The chain runs right-to-left: unicode61 splits + diacritic-folds first (so non-ASCII content still tokenises correctly), then porter stems the resulting ASCII tokens. "colleague", "colleagues", and "colleague's" share a stem and match each other; "bake" matches "baking" / "baked" / "bakes"; "research" matches "researched" / "researches"; "agency" matches "agencies". The \`retrieval.fts.tokenizer\` config key now defaults to \`porter\` and is documented as honoured by the FTS index (it was previously declared but ignored by the migration — dead-code tunability that this change makes real). Migration 0008 drops and rebuilds \`memories_fts\` with the new tokenizer, preserving stable rowids via the \`memories_fts_map\` table; the runner applies it on first server start after upgrade, so no operator action is required. Six new unit tests in \`0008_fts_porter_tokenizer.test.ts\` cover the acceptance criteria: stem variants match across plural/singular and verb-form pairs, pre-migration memories are re-indexed under the new tokenizer, the insert/update/delete triggers carry the new tokenizer through write-path operations, and non-ASCII content (German umlauts, French diacritics, Japanese katakana) survives the chain intact. The trade-off accepted is porter's known over-stems (organize/organic, universe/university). For Memento's dominant query distribution — assistants asking about durable user state in natural language — recall on stem variants is worth more than precision on these edge cases. Operators who need the older behaviour can author a follow-up migration; the config key documents the option. **Embedder perf — real batched feature-extraction.** \`@psraghuveer/memento-embedder-local\`'s \`embedBatch\` now uses transformers.js v3's array-input pipeline, which runs one forward pass for the whole batch instead of looping per text. Numerically identical to the single-call form (verified row-by-row against the same input). Measured ~1.8x speedup on a 3-input batch with \`bge-base-en-v1.5\` on CPU; the speedup grows with batch size because tokenisation and pipeline setup amortise across the batch. The loader contract now returns \`{ embed, embedBatch? }\` instead of a bare \`embed\` function; loaders that omit \`embedBatch\` fall back to the previous sequential behaviour, so test fixtures and bespoke implementations keep working unchanged. Seven new unit tests cover the fast path, the sequential fallback, empty-input short-circuit, runtime-row-count mismatch, per-row dimension validation, batched \`maxInputBytes\` truncation, and whole-batch timeout. The \`EmbeddingProvider.embedBatch\` surface in \`@psraghuveer/memento-core\` is unchanged and remains optional; existing call sites that go through \`embedBatchFallback\` automatically pick up the fast path. **Release note.** Single changeset bumps schema / core / embedder-local / memento minor — the FTS-tokenizer default flips, the \`extract_memory\` tool description shifts, and the embedder batches in user-observable ways. \`memento-landing\` is private (marketing site, not published to npm) and is added to the changesets \`ignore\` list so it can't accidentally land in any future release note. \`docs/reference/{cli,mcp-tools,config-keys}.md\` is regenerated to reflect the new tool description, default, and error message. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../2026-05-14T1430-distillation-clarity.md | 14 + .changeset/config.json | 2 +- .gitignore | 6 + .markdownlint-cli2.jsonc | 13 +- docs/guides/benchmark.md | 130 ++++ docs/guides/teach-your-assistant.md | 106 +++- docs/reference/cli.md | 15 +- docs/reference/config-keys.md | 2 +- docs/reference/mcp-tools.md | 15 +- packages/core/src/commands/memory/extract.ts | 30 +- .../migrations/0008_fts_porter_tokenizer.ts | 142 +++++ packages/core/src/storage/migrations/index.ts | 2 + .../0008_fts_porter_tokenizer.test.ts | 249 ++++++++ packages/embedder-local/src/embedder.ts | 110 +++- packages/embedder-local/src/index.ts | 2 + packages/embedder-local/test/embedder.test.ts | 139 ++++- packages/landing/src/App.tsx | 37 +- packages/schema/src/config-keys.ts | 4 +- packages/schema/src/primitives.ts | 7 +- scripts/bench.mjs | 553 ++++++++++++++++++ skills/memento/SKILL.md | 48 +- 21 files changed, 1572 insertions(+), 54 deletions(-) create mode 100644 .changeset/2026-05-14T1430-distillation-clarity.md create mode 100644 docs/guides/benchmark.md create mode 100644 packages/core/src/storage/migrations/0008_fts_porter_tokenizer.ts create mode 100644 packages/core/test/storage/migrations/0008_fts_porter_tokenizer.test.ts create mode 100644 scripts/bench.mjs diff --git a/.changeset/2026-05-14T1430-distillation-clarity.md b/.changeset/2026-05-14T1430-distillation-clarity.md new file mode 100644 index 0000000..f633e27 --- /dev/null +++ b/.changeset/2026-05-14T1430-distillation-clarity.md @@ -0,0 +1,14 @@ +--- +"@psraghuveer/memento-schema": minor +"@psraghuveer/memento-core": minor +"@psraghuveer/memento-embedder-local": minor +"@psraghuveer/memento": minor +--- + +Make Memento more usable for AI-assisted memory work — clearer write-side contract, stronger read-side recall, and faster batched embeddings. + +**Write side — distillation contract clarity.** The MCP tool description on `extract_memory` flags the candidate-shape difference from `write_memory` (flat `kind` enum, top-level `rationale`/`language`), states the `topic: value\n\nprose` requirement for `preference`/`decision` kinds, and notes the `storedConfidence: 0.8` async-default. An inline example shows four kinds with the correct field placement — including a `preference` candidate that opens with the required topic-line and a `decision` candidate with top-level `rationale`. `TagSchema` emits an actionable error message listing the allowed charset instead of a bare "Invalid". The skill, persona-snippet guide, and landing-page persona-snippet mirror carry a "Distillation craft" section that frames the task as retrieval indexing (not summarisation) and codifies six rules: preserve specific terms (proper nouns, identity qualifiers, dates, named entities); capture facts about every named participant, not only the user (a friend the user mentions, a colleague, a co-speaker — facts they share about themselves AND the user's observations about them are both worth indexing, attributed to the right named person); emit a candidate for every dated event with the date resolved against the session anchor; capture precursor actions alongside outcomes ("researched X" AND "chose Y" as separate candidates, since future questions can target either); don't squash enumerations into category labels; bias toward inclusion (the server dedups). + +**Read side — porter stemming for FTS5.** `memories_fts` is now built with `tokenize='porter unicode61'` instead of the default `unicode61`. The chain has unicode61 split + diacritic-fold first, then porter stem the resulting tokens — so "colleague", "colleagues", and "colleague's" share a stem and match each other in keyword search, and "bake" matches "baking" / "baked" / "bakes". Non-ASCII content still tokenises correctly because unicode61 runs first. The `retrieval.fts.tokenizer` config key now defaults to `porter` and is documented as honoured by the FTS index (previously declared but ignored). Migration 0008 drops and rebuilds `memories_fts` with the new tokenizer, preserving stable rowids via the `memories_fts_map` table; the runner applies it on first server start after upgrade, so no operator action is required. Recall on natural-language queries — where the speaker's wording and the future question's wording differ in plural, verb form, or possessive — improves at the FTS layer instead of depending on vector search to rescue every morphological miss. + +**Embedder perf — real batched feature-extraction.** `@psraghuveer/memento-embedder-local`'s `embedBatch` now uses transformers.js v3's array-input pipeline, which runs one forward pass for the whole batch instead of looping per text. Numerically identical to the single-call form (verified row-by-row against the same input). Measured ~1.8× speedup on a 3-input batch with `bge-base-en-v1.5` on CPU; the speedup grows with batch size because tokenisation and pipeline setup amortise across the batch. The loader contract now returns `{ embed, embedBatch? }` instead of a bare `embed` function; loaders that omit `embedBatch` fall back to the previous sequential behaviour, so test fixtures and bespoke implementations keep working unchanged. The `EmbeddingProvider.embedBatch` surface in `@psraghuveer/memento-core` is unchanged and remains optional; existing call sites that go through `embedBatchFallback` automatically pick up the fast path. diff --git a/.changeset/config.json b/.changeset/config.json index fce1c26..10563f5 100644 --- a/.changeset/config.json +++ b/.changeset/config.json @@ -7,5 +7,5 @@ "access": "public", "baseBranch": "main", "updateInternalDependencies": "patch", - "ignore": [] + "ignore": ["@psraghuveer/memento-landing"] } diff --git a/.gitignore b/.gitignore index 3d21a41..ee9c525 100644 --- a/.gitignore +++ b/.gitignore @@ -59,6 +59,12 @@ docs/reference/*.generated.md .stryker-tmp/ reports/ +# Local artifacts from scripts/{bench,retrieval-eval,stress-test}.mjs +bench/ +bench-output-*/ +eval-report-*.md +stress-report-*.md + # Skills bundle staged by the CLI build (source of truth lives at # /skills/; packages/cli/skills/ is a per-build copy for the npm tarball). packages/cli/skills/ diff --git a/.markdownlint-cli2.jsonc b/.markdownlint-cli2.jsonc index 88ae04b..2b244b4 100644 --- a/.markdownlint-cli2.jsonc +++ b/.markdownlint-cli2.jsonc @@ -49,5 +49,16 @@ "MD060": false }, "globs": ["**/*.md"], - "ignores": ["**/node_modules/**", "**/CHANGELOG.md", "**/dist/**", "**/coverage/**"] + "ignores": [ + "**/node_modules/**", + "**/CHANGELOG.md", + "**/dist/**", + "**/coverage/**", + // Local artifacts from scripts/*.mjs (also in .gitignore). Not + // committed, not authored by us — just bench/eval/stress outputs. + "bench/**", + "bench-output-*/**", + "eval-report-*.md", + "stress-report-*.md" + ] } diff --git a/docs/guides/benchmark.md b/docs/guides/benchmark.md new file mode 100644 index 0000000..8babbd1 --- /dev/null +++ b/docs/guides/benchmark.md @@ -0,0 +1,130 @@ +# End-to-end benchmark (memorybench) + +Memento ships a driver at [`scripts/bench.mjs`](../../scripts/bench.mjs) that runs the public [`supermemoryai/memorybench`](https://github.com/supermemoryai/memorybench) harness end to end against a locally built Memento. Datasets, retrieval workflow, answering model, and LLM judge all come from memorybench — Memento contributes only the `MementoProvider` implementation under `src/providers/memento/`, proposed to memorybench upstream as a PR. Until that PR merges, point `--memorybench-dir` at a local checkout of the PR branch. + +This is one of three measurement scripts; each answers a different question: + +- [`stress-test.md`](stress-test.md) — *is the engine fast, correct, stable at scale?* (throughput, latency, contract probes) +- [`retrieval-eval.md`](retrieval-eval.md) — *is the ranker returning the right memories?* (Recall, MRR, nDCG over a small labeled needle set) +- `bench.mjs` (this guide) — *how does Memento answer real long-conversation questions?* (LoCoMo + LongMemEval, judged by an LLM) + +`bench.mjs` is not part of `pnpm verify`. It needs network access, judge API keys, and hours of wall-clock; CI gates must pass offline. + +## How the provider uses Memento + +Memento stores **distilled assertions, not transcripts** — the calling AI assistant uses its own LLM to decide what's worth remembering, then hands those candidates to Memento's `extract_memory` MCP tool. The bench provider mirrors that flow inside the harness: for each `UnifiedSession` produced by memorybench, the provider calls the configured LLM (defaults to the bench's answering model) to produce structured `{kind, content}` candidates and writes them via `extract_memory`. Memento embeds, scrubs, dedups, and persists. The provider does no raw-message ingestion — the LLM-distillation step is what represents real Memento usage faithfully. + +## Running it + +Prereqs: [Bun](https://bun.sh) (`bun --version` >= 1.0), Node 22+, and the `ANTHROPIC_API_KEY` env var (the default judge / answering / distillation model is `sonnet-4.6`, because that's what the bulk of MCP-using clients run on the conversation side). For other model families, set the corresponding key (`GOOGLE_API_KEY` / `OPENAI_API_KEY`). The script builds Memento itself. + +If your shell sets `ANTHROPIC_BASE_URL` (some agent runtimes do), ensure it ends in `/v1` — the AI SDK appends `/messages` to whatever you give it, so a bare `https://api.anthropic.com` produces a 404. Either unset the variable or point it at `https://api.anthropic.com/v1`. + +```bash +export ANTHROPIC_API_KEY=... +node scripts/bench.mjs # LoCoMo + LongMemEval, defaults (sonnet-4.6) +node scripts/bench.mjs --benchmark=locomo --limit=5 # 5-question smoke test +node scripts/bench.mjs --judge=gemini-2.5-pro # cross-family judge (needs GOOGLE_API_KEY) +node scripts/bench.mjs --concurrency-ingest=1 # serialize ingest to stay under Anthropic rate limits +``` + +A summary markdown is written to `bench/.md` (one file per invocation, the directory is git-ignored); full per-question JSON reports live under the staged memorybench checkout at `data/runs//report.json`. If a run crashes mid-flight (Anthropic `Overloaded`, network drop, Ctrl-C, OOM) the orchestrator persists a checkpoint after every phase boundary — re-run with `node scripts/bench.mjs --resume=` to pick up at the failed phase of the failed question, skipping all completed work. The runId is logged on a dedicated line in the bench log and printed again as a copy-pasteable command if the run exits non-zero. + +### Flags + +| Flag | Default | Notes | +|---|---|---| +| `--benchmark=` | `locomo,longmemeval` | Which benchmarks to run. Memorybench also has `convomem` (deferred). | +| `--judge=` | `sonnet-4.6` | Model that scores `correct`/`incorrect`. Use a different family (e.g. `gemini-2.5-pro`) for cross-family-independence. | +| `--answering-model=` | `sonnet-4.6` | Model that generates the hypothesis and (by default) the per-session distillation. | +| `--search-limit=` | `30` | top-K returned by the provider. Passed through as `MEMENTO_BENCH_SEARCH_LIMIT`. | +| `--limit=` | *(none)* | Cap questions per benchmark. Use for smoke tests. | +| `--memorybench-ref=` | `add-memento-provider` | Git ref of the fork to clone. | +| `--memorybench-dir=` | *(none)* | Use a local fork checkout instead of cloning. Skips network. | +| `--memorybench-repo=` | *(see DEFAULTS)* | Override the fork URL. | +| `--concurrency-ingest=` | *(memorybench default)* | Lower to tame the embedder during ingest. `--concurrency-{indexing,search,answer,evaluate}` also accepted. | +| `--out=` | `/bench` | Directory the summary `.md` is written into. | +| `--resume=` | *(none)* | Resume a crashed run by its runId (`memento--`). Reuses the checkpoint at `/data/runs//checkpoint.json` and skips all completed phases. | + +### Env vars + +| Var | When | Purpose | +|---|---|---| +| `ANTHROPIC_API_KEY` / `OPENAI_API_KEY` / `GOOGLE_API_KEY` | Required for the chosen judge / answering family (the default needs `ANTHROPIC_API_KEY`). | Memorybench's judge + answering layer, and the provider's distillation step. | +| `ANTHROPIC_BASE_URL` | Optional override; must end in `/v1` if set. | The AI SDK appends `/messages`, so a bare domain produces 404s. | +| `MEMENTO_DISTILL_MODEL` | Optional | Override which model does session-level distillation (default: the answering model). | +| `MEMORYBENCH_REPO` / `MEMORYBENCH_REF` | Optional | Override default fork URL and ref. | +| `MEMORYBENCH_DIR` | Optional | Same as `--memorybench-dir`. | +| `MEMENTO_BENCH_KEEP_WORKDIR` | `1` to keep | Skip cleanup of the cloned fork dir at the end of a run. | + +## What's measured + +Memorybench reports per benchmark: + +| Metric | Meaning | +|---|---| +| `accuracy` | `correctCount / totalQuestions` (judge labeling). | +| `MemScore` | Composite `${qualityPct}% / ${avgLatencyMs}ms / ${avgContextTokens}tok` — three numbers, never collapsed into one. | +| `latency.{ingest,indexing,search,answer,evaluate}` | Per-phase percentiles (p50/p95/p99). The search number is what gets surfaced in MemScore. | +| `tokens.avgContextTokens` | Average tokens of retrieved context the answering model received per question. | +| `byQuestionType` | Accuracy + latency broken out by the dataset's native question-type taxonomy. | + +The summary file produced by `bench.mjs` extracts the headline + per-type table into one place. The raw `report.json` carries the full `evaluations` array if you need to trace a single question. + +## How Memento exercises memorybench + +`MementoProvider` implements memorybench's five-method `Provider` interface: + +1. **`initialize`** spawns `memento serve --db ` over stdio, asserts the required MCP tools (`extract_memory`, `search_memory`, `forget_many_memories`) are present, and runs one warmup write so the bge-base-en-v1.5 embedder model is loaded before the first benchmark question. +2. **`ingest`** distills each `UnifiedSession` through the configured LLM into `{kind, content}` candidates, then hands the batch to Memento's `extract_memory`. Memories land under `scope = {type: 'workspace', path: '/memorybench/'}` (workspace scope is the isolation primitive because Memento's `session.id` requires a ULID, while memorybench's `containerTag` is an arbitrary string). Each memory carries `benchmark:memorybench`, `session:`, and (when present) `session-date:` tags. +3. **`awaitIndexing`** polls `search_memory` on the question's scope until every result has `embeddingStatus !== 'pending'` (or a configurable 180s deadline expires). +4. **`search`** runs `search_memory` with the question's scope filter, `projection: 'full'`, and `limit: options.limit ?? 30`. +5. **`clear`** filters `forget_many_memories` by scope. The orchestrator doesn't call this in normal runs — containers persist for the run's lifetime — but it's there for partial-rerun recovery. + +The provider supplies a custom `answerPrompt` (`src/providers/memento/prompts.ts` in the fork) that presents each retrieved memory with its score, kind, and session date — the latter being the temporal anchor the LLM used during distillation. Default JSON-stringified context would hide this structure. + +## Methodology and caveats + +**One server per benchmark run.** Spawning a fresh `memento serve` per question (~2000 spawns across LoCoMo + LongMemEval) is expensive without buying any isolation that workspace scope doesn't already give. The single-DB approach mirrors how `scripts/retrieval-eval.mjs` plants thousands of needles in one in-memory SQLite. + +**Container isolation via workspace scope.** Each question's memories land under `{type: 'workspace', path: '/memorybench/'}`. Memento's architectural rule that scope is immutable per memory makes this isolation contract reliable. Searches always pass the per-question scope; `clear` filters by it. + +**Datasets are fetched at runtime.** LoCoMo from `raw.githubusercontent.com/snap-research/locomo`, LongMemEval from HuggingFace `xiaowu0162/longmemeval-cleaned`. The first run of each benchmark downloads and caches the JSON under the staged memorybench checkout. A move or removal of the upstream dataset breaks reproducibility — see the [risks](#risks) section. + +**Judge model = answering model = distillation model.** The baseline pins all three to `sonnet-4.6` — the model class that actually shows up on the conversation side in real Memento usage (Claude Code, Cursor, Claude Desktop dominate the MCP-using-client population, and `extract_memory` is called from that same assistant). Using the same model for all three avoids one layer of cross-family bias but does collapse to a single family's strengths and weaknesses; the standard robustness move is to swap the judge to a different family (e.g. `--judge=gemini-2.5-pro` or `--judge=gpt-4o`) and report agreement rates. Override via `--judge` / `--answering-model` / `MEMENTO_DISTILL_MODEL`. + +**Container-tag stability on resume.** Memorybench's orchestrator tracks `completedSessions` per question, so a resumed run skips already-ingested sessions; the provider doesn't need to dedup on the write path. + +## How to reproduce + +```bash +git clone https://github.com/veerps57/memento && cd memento +pnpm install +# Build is run automatically by bench.mjs; can be done up front for offline reuse: +# pnpm -F @psraghuveer/memento-schema -F @psraghuveer/memento-core -F @psraghuveer/memento-server -F @psraghuveer/memento -F @psraghuveer/memento-embedder-local build + +export ANTHROPIC_API_KEY=... +node scripts/bench.mjs --memorybench-ref= +``` + +The summary's reproducibility footer captures the exact Memento + memorybench commits used so a re-run against the same SHAs is byte-comparable up to model nondeterminism and dataset-upstream drift. + +## Baseline results + +*(Populated by the follow-up PR after the first full LoCoMo + LongMemEval run lands. Until then, this section reads "pending".)* + +## Risks + +- **Dataset upstream availability.** LoCoMo and LongMemEval are fetched on first use; if either repository moves or is taken down mid-run, the affected benchmark fails. Mitigation: pin a fork commit that vendors the datasets, or re-run from a cached checkout (`--memorybench-dir`). +- **Embedder model cold-download.** bge-base-en-v1.5 is downloaded from HuggingFace on the first Memento write of a fresh install. The provider's warmup step pulls this download into `initialize()` so the first benchmark question doesn't pay the cost — but the warmup itself can fail under flaky network. +- **Distillation LLM rate limits.** Per-session distillation adds one LLM call per session × ~20 sessions per question × ~20 questions per benchmark = a few hundred extra calls. With the default `sonnet-4.6`, Anthropic returns `Overloaded` (HTTP 529) under bursty parallel load; memorybench retries three times then fails the question. `--concurrency-ingest=1` serializes the ingest path and also lets the provider's per-session distillation cache hit (questions that share sessions skip the redundant LLM call) — it's the safest knob for any Anthropic-family distill. +- **Judge API rate limits.** Anthropic / OpenAI / Google RPM caps can stall the `evaluate` phase. Use `--concurrency-evaluate=10` (or lower) to soften the rate at the cost of wall-clock. +- **`ANTHROPIC_BASE_URL` shadowing.** Some agent runtimes set `ANTHROPIC_BASE_URL=https://api.anthropic.com` (no `/v1`). The `@ai-sdk/anthropic` client appends `/messages` to that, producing a 404 that surfaces as a generic `Not Found` ingest failure. Either unset the variable or point it at `https://api.anthropic.com/v1` before running the bench. + +## Out of scope / future + +- ConvoMem as a third benchmark. +- Multi-judge runs (Sonnet + GPT-4o + Gemini agreement rates) for robustness against single-judge bias. +- GitHub Actions nightly bench against `main` with auto-issue on regression. +- Dashboard widget showing the latest baseline. +- Pinning the driver to `npx @psraghuveer/memento@` after a stable release so external contributors don't need a pnpm monorepo to reproduce. diff --git a/docs/guides/teach-your-assistant.md b/docs/guides/teach-your-assistant.md index 385e86b..474854f 100644 --- a/docs/guides/teach-your-assistant.md +++ b/docs/guides/teach-your-assistant.md @@ -58,6 +58,75 @@ editing, the error message you just saw, what the user typed five minutes ago). Memento is not a chat log. ``` +### Preserve specific terms — don't paraphrase qualifiers away + +"Distilled, not transcript" doesn't mean "summarised into generic categories." You are not summarising for a reader — you are producing retrieval candidates for unknown future queries. The future question uses the specific terms the speaker used; an assistant that drops them blocks recall. + +```text +You are not summarising the conversation. You are producing +retrieval candidates for unknown future queries — the future +question may ask about any specific date, named entity, proper +noun, action, or object that came up. Index every concrete +reference; don't capture the gist. + +Preserve specific words. Use the speakers' exact terms for proper +nouns, named entities, identity qualifiers, places, dates, and the +specific object of any action. + +- "researched adoption agencies" → "Raghu researched adoption + agencies", not "Raghu researched career options". +- "transgender woman" → "Raghu is a transgender woman", not + "Raghu identifies as a woman". +- "the Wonderland Trail" → name it, not "a hiking trail". +- "May 7" → resolve to an absolute date and emit it, not + "in spring". + +Capture facts about every named participant, not only the user. +A conversation may mention or include other people — a friend the +user talks about, a colleague, a family member, or a co-speaker in +a shared session. Facts they share about themselves AND the user's +specific observations about them are both worth indexing, each +attributed to the right named person. + +- "My friend Alex is moving to Berlin next month for a SAP job" + → emit "Alex is moving to Berlin in " AND "Alex has a + new job at SAP" (attributed to Alex, not collapsed to Raghu). +- In a meeting transcript where Sarah said "I have three kids" + and Raghu said "I work from home" → both facts get captured, + each attributed to its speaker. Don't bias toward the first + speaker or the apparent "user" persona. + +Emit a candidate for every dated event. If the user mentions an +event with a resolvable date — absolute ("May 7") or relative +("yesterday", "last Tuesday", "two weeks ago") — emit one +candidate with the absolute date in the content. Resolve relative +dates against the current date. Do NOT generalise dated events +into untimed habits ("the user attends conferences" loses the +date). When in doubt, emit both a dated candidate AND a general +one. The future "when did X happen?" question can only be +answered by a memory that names the date. + +Capture precursor actions alongside outcomes. When the user +describes a sequence ("researched X then chose Y", "tried A and +settled on B"), emit both: a candidate for the precursor (the +research, the try) AND a candidate for the outcome. Future +questions can target either step — "what did Raghu research?" +and "what did Raghu choose?" have different answers. + +Don't squash enumerations. If the user lists four activities, +emit four facts (or one fact that names all four explicitly) — +never one fact that says "outdoor activities and crafts". + +Bias toward inclusion. The server dedups via embedding +similarity; over-including is cheap, under-including drops the +fact entirely. + +Before finalising a write_memory or extract_memory call, scan the +conversation once more: every date or time-relative word, every +proper noun, every action verb with a specific object — does each +map to at least one candidate? If a reference is missing, add it. +``` + ### Use a `topic: value` first line for preferences and decisions Conflict detection on `preference` and `decision` memories parses the *first line* of `content` as `topic: value` (or `topic = value`). Two memories with the same topic and different values are flagged for triage. Free-prose content without a parseable first line never conflicts — so an assistant that writes "Raghu prefers bun" today and "Raghu uses npm" tomorrow leaves both rows active with no surfaced contradiction. @@ -185,10 +254,45 @@ memory; treat chat as ephemeral. skipped:[], superseded:[], mode:"async", batchId, hint, status: "accepted"}` — that is the receipt, not a failure. Writes land as memories within seconds; do not retry. +- `extract_memory`'s candidate shape is **flat** — `kind` is a + string (`"kind":"fact"`) and `rationale` / `language` are + top-level fields. This differs from `write_memory`, which uses + a discriminated-union `kind` object (`"kind":{"type":"fact"}`) + with those fields nested inside. Copying the write_memory shape + into an extract candidate produces `INVALID_INPUT` and rejects + the whole batch. - For preferences and decisions, start `content` with a single `topic: value` line followed by prose. Conflict detection parses that line; without it, contradictory preferences - silently coexist. + silently coexist. The same rule applies to both `write_memory` + and `extract_memory`. +- Distillation is **retrieval indexing**, not summarisation. The + future question may ask about any specific date, named entity, + proper noun, action, or object that came up — index every + concrete reference, don't capture the gist. +- Preserve specific terms when distilling. Use the speakers' exact + words for proper nouns, identity qualifiers, places, dates, and + the object of any action. "Raghu researched adoption agencies", + not "researched career options". "Transgender woman", not + "woman". Don't squash enumerations; emit each item or list them + explicitly. When in doubt, include — the server dedups. +- Capture facts about every named participant, not only the user. + If the user mentions someone ("my friend Alex is moving to + Berlin for a SAP job"), emit memories attributed to that named + person (Alex is moving to Berlin; Alex has a new job at SAP), + not collapsed onto the user. The future question may ask about + anyone named in the conversation. +- Emit a candidate for every dated event. If the user mentions an + event with an absolute date ("May 7") or a relative one + ("yesterday", "last Tuesday"), resolve to an absolute date and + emit it in the content. Don't fold dated events into untimed + habits — the future "when did X happen?" query can only be + answered by a memory that names the date. +- Capture precursor actions alongside outcomes. When the user + describes a sequence ("researched X then chose Y", "tried A and + settled on B"), emit both — a candidate for the precursor and a + candidate for the outcome. Future questions can target either + step; the outcome never erases the precursor. - Use the user's preferred name from `info_system.user.preferredName` when authoring memory content; fall back to "The user" when that field is null. diff --git a/docs/reference/cli.md b/docs/reference/cli.md index 10e4db8..8a5a990 100644 --- a/docs/reference/cli.md +++ b/docs/reference/cli.md @@ -285,14 +285,25 @@ Optional `types`, `since`, `until`, and `limit` narrow the result set; `since`/` Batch-extract candidate memories from a conversation. The server handles dedup, scrubbing, and writing. The assistant's job is reduced to dumping "what seemed worth remembering." +**Candidate shape note** — this command's `kind` field is a flat string (`"kind": "fact"`), and `rationale` / `language` are top-level fields. This differs from `memory.write`, where `kind` is a discriminated-union object and those fields nest inside it. Copying the write_memory shape will fail validation with `INVALID_INPUT`. + +**Topic-line gotcha** — for `preference` and `decision` candidates, the `content` MUST start with a `topic: value` line followed by a blank line and prose. The conflict detector parses that first line; without it, contradictory preferences silently coexist. The handler returns `INVALID_INPUT` for offending candidates — the whole batch is rejected, not just the bad items. + Dedup runs at two scopes: (1) **in-batch** — byte-identical candidates within the same call collapse to a single memory (kind-aware fingerprint); (2) **cross-batch** — embeddings are compared against existing active memories via the configured similarity thresholds (≥`extraction.dedup.identicalThreshold` skips, between that and `extraction.dedup.threshold` supersedes, below writes new). When in doubt, include the candidate. +**Storage defaults** — extracted memories are written at `storedConfidence: 0.8` (lower than `memory.write`'s 1.0) so they decay faster and get pruned if never confirmed. This biases toward precision: tentative captures don't crowd out user-stated facts. + The response carries a `mode` field. When `mode: "sync"`, the `written`, `skipped`, and `superseded` arrays are authoritative and you can report them directly. When `mode: "async"` (the default per `extraction.processing` config), those arrays are intentionally empty — the server returned a receipt and is processing in background. The accompanying `hint` field explains what to expect; do not retry. Writes land as memories within ~1–5 seconds and can be confirmed with `list_memories` or `search_memory` if needed. -Example: +Example (note the flat kind, the topic-line on the preference, and top-level rationale on the decision): ```json -{"candidates":[{"kind":"preference","content":"User prefers dark mode in all editors"},{"kind":"fact","content":"The production database is PostgreSQL 15"}]} +{"candidates":[ + {"kind":"preference","content":"editor-theme: dark\n\nUser prefers dark mode in all editors."}, + {"kind":"fact","content":"The production database is PostgreSQL 15."}, + {"kind":"decision","content":"storage-engine: SQLite\n\nChosen for the local-first story; FTS5 built in.","rationale":"Single-file, no daemon, prebuilt for every platform."}, + {"kind":"snippet","content":"memento read ","language":"shell"} +]} ``` - **Side-effect:** `write` — Mutates state and emits an audit-log event. diff --git a/docs/reference/config-keys.md b/docs/reference/config-keys.md index e167fbd..5c585b6 100644 --- a/docs/reference/config-keys.md +++ b/docs/reference/config-keys.md @@ -84,7 +84,7 @@ Total: 98 keys. | Key | Default | Mutable | Description | | --- | --- | --- | --- | -| `retrieval.fts.tokenizer` | `"unicode61"` | no | FTS5 tokenizer for `memories_fts`. Pinned at server start because changing it requires a reindex. | +| `retrieval.fts.tokenizer` | `"porter"` | no | FTS5 tokenizer for `memories_fts`. `porter` stems tokens (so "colleagues" and "colleague's" share a stem and match each other in search) and is chained onto `unicode61` so non-ASCII content still tokenises correctly. `unicode61` alone is exact-token matching with no stemming — choose this only when proper-noun precision matters more than prose recall. Migration 0008 sets the FTS index up with porter; switching to `unicode61` requires a manual `drop table memories_fts` and reindex, which is why the key is immutable. | | `retrieval.vector.enabled` | `true` | yes | When true, retrieval unions FTS candidates with cosine-similarity matches over `embedding`. Requires an `EmbeddingProvider` to be wired into the host; `memory.search` returns CONFIG_ERROR when the flag is on and no provider is present. | | `retrieval.vector.backend` | `"auto"` | no | Vector search backend selector. `brute-force` is the shipping backend; `auto` resolves to it. | | `retrieval.ranker.strategy` | `"linear"` | yes | Ranker strategy. `linear` (default) is the weighted-sum ranker that the shipped `retrieval.ranker.weights.*` defaults are tuned for: FTS and cosine arms are batch-max-normalised to `[0, 1]` and composed with the four baseline arms (confidence, recency, scope, pinned) which are already `[0, 1]`. `rrf` (Reciprocal Rank Fusion) replaces the FTS and cosine arms with rank-based contributions `weight_a / (k + rank_a)` — values at `k=60` are around `0.016` at rank 1, three orders of magnitude smaller than `linear`. Flipping to `rrf` at the shipped weight defaults will heavily suppress the FTS and vector arms relative to the baselines; rescale `retrieval.ranker.weights.fts` / `retrieval.ranker.weights.vector` by roughly `(k + 1)` when switching. Tune `k` via `retrieval.ranker.rrf.k`. | diff --git a/docs/reference/mcp-tools.md b/docs/reference/mcp-tools.md index 5c13b6e..19703e1 100644 --- a/docs/reference/mcp-tools.md +++ b/docs/reference/mcp-tools.md @@ -194,14 +194,25 @@ Registry name: `memory.extract` — CLI: `memento memory extract` Batch-extract candidate memories from a conversation. The server handles dedup, scrubbing, and writing. The assistant's job is reduced to dumping "what seemed worth remembering." +**Candidate shape note** — this command's `kind` field is a flat string (`"kind": "fact"`), and `rationale` / `language` are top-level fields. This differs from `memory.write`, where `kind` is a discriminated-union object and those fields nest inside it. Copying the write_memory shape will fail validation with `INVALID_INPUT`. + +**Topic-line gotcha** — for `preference` and `decision` candidates, the `content` MUST start with a `topic: value` line followed by a blank line and prose. The conflict detector parses that first line; without it, contradictory preferences silently coexist. The handler returns `INVALID_INPUT` for offending candidates — the whole batch is rejected, not just the bad items. + Dedup runs at two scopes: (1) **in-batch** — byte-identical candidates within the same call collapse to a single memory (kind-aware fingerprint); (2) **cross-batch** — embeddings are compared against existing active memories via the configured similarity thresholds (≥`extraction.dedup.identicalThreshold` skips, between that and `extraction.dedup.threshold` supersedes, below writes new). When in doubt, include the candidate. +**Storage defaults** — extracted memories are written at `storedConfidence: 0.8` (lower than `memory.write`'s 1.0) so they decay faster and get pruned if never confirmed. This biases toward precision: tentative captures don't crowd out user-stated facts. + The response carries a `mode` field. When `mode: "sync"`, the `written`, `skipped`, and `superseded` arrays are authoritative and you can report them directly. When `mode: "async"` (the default per `extraction.processing` config), those arrays are intentionally empty — the server returned a receipt and is processing in background. The accompanying `hint` field explains what to expect; do not retry. Writes land as memories within ~1–5 seconds and can be confirmed with `list_memories` or `search_memory` if needed. -Example: +Example (note the flat kind, the topic-line on the preference, and top-level rationale on the decision): ```json -{"candidates":[{"kind":"preference","content":"User prefers dark mode in all editors"},{"kind":"fact","content":"The production database is PostgreSQL 15"}]} +{"candidates":[ + {"kind":"preference","content":"editor-theme: dark\n\nUser prefers dark mode in all editors."}, + {"kind":"fact","content":"The production database is PostgreSQL 15."}, + {"kind":"decision","content":"storage-engine: SQLite\n\nChosen for the local-first story; FTS5 built in.","rationale":"Single-file, no daemon, prebuilt for every platform."}, + {"kind":"snippet","content":"memento read ","language":"shell"} +]} ``` - **Side-effect:** `write` — Mutates state and emits an audit-log event. diff --git a/packages/core/src/commands/memory/extract.ts b/packages/core/src/commands/memory/extract.ts index 808d126..9d2c3bc 100644 --- a/packages/core/src/commands/memory/extract.ts +++ b/packages/core/src/commands/memory/extract.ts @@ -46,18 +46,36 @@ const DRY_RUN_PLACEHOLDER_ID = '00000000000000000000000000' as unknown as Memory const ExtractionCandidateSchema = z .object({ - kind: z.enum(MEMORY_KIND_TYPES).describe('Memory kind for this candidate.'), - content: z.string().min(1).describe('The memory content to extract.'), - tags: z.array(z.string()).optional().describe('Optional tags for this candidate.'), + kind: z + .enum(MEMORY_KIND_TYPES) + .describe( + 'Memory kind: one of "fact", "preference", "decision", "todo", "snippet" — a plain string enum. NOTE: this is **flat** here (e.g. `"kind": "preference"`), unlike `memory.write` where the same field is a discriminated-union object (`"kind": {"type": "preference"}`). Reusing the write_memory shape will fail validation.', + ), + content: z + .string() + .min(1) + .describe( + 'The memory content to extract. For `preference` and `decision` kinds, the first line MUST be `topic: value` (or `topic = value`) followed by a blank line and prose — this is what the conflict detector parses; without it, contradictory preferences silently coexist. Example: `"node-package-manager: pnpm\\n\\nUser prefers pnpm over npm."`. `fact` / `todo` / `snippet` kinds use different conflict heuristics and don\'t require this format.', + ), + tags: z + .array(z.string()) + .optional() + .describe( + 'Optional tags for this candidate. Each tag is normalised (trimmed + lowercased) and validated against `[a-z0-9._:/-]` — spaces, commas, and uppercase are rejected. Pre-process human-prose values (replace spaces/commas with `-`).', + ), summary: z.string().nullable().optional().describe('Optional one-line summary.'), rationale: z .string() .optional() - .describe('Rationale for the decision (recommended for decision kind).'), + .describe( + 'Rationale for the decision — **top-level field, used only with `kind: "decision"`**. NOTE: this differs from `memory.write`, where `rationale` lives inside `kind: {type: "decision", rationale: "..."}`. In extract\'s flat candidate shape, rationale sits beside kind.', + ), language: z .string() .optional() - .describe('Programming language (recommended for snippet kind).'), + .describe( + 'Programming language hint — **top-level field, used only with `kind: "snippet"`** (e.g. "typescript", "shell"). Like `rationale`, this differs from `memory.write` where `language` lives inside `kind: {type: "snippet", language: "..."}`.', + ), }) .strict(); @@ -156,7 +174,7 @@ export function createMemoryExtractCommand( outputSchema: MemoryExtractOutputSchema, metadata: { description: - 'Batch-extract candidate memories from a conversation. The server handles dedup, scrubbing, and writing. The assistant\'s job is reduced to dumping "what seemed worth remembering."\n\nDedup runs at two scopes: (1) **in-batch** — byte-identical candidates within the same call collapse to a single memory (kind-aware fingerprint); (2) **cross-batch** — embeddings are compared against existing active memories via the configured similarity thresholds (≥`extraction.dedup.identicalThreshold` skips, between that and `extraction.dedup.threshold` supersedes, below writes new). When in doubt, include the candidate.\n\nThe response carries a `mode` field. When `mode: "sync"`, the `written`, `skipped`, and `superseded` arrays are authoritative and you can report them directly. When `mode: "async"` (the default per `extraction.processing` config), those arrays are intentionally empty — the server returned a receipt and is processing in background. The accompanying `hint` field explains what to expect; do not retry. Writes land as memories within ~1–5 seconds and can be confirmed with `list_memories` or `search_memory` if needed.\n\nExample:\n\n```json\n{"candidates":[{"kind":"preference","content":"User prefers dark mode in all editors"},{"kind":"fact","content":"The production database is PostgreSQL 15"}]}\n```', + 'Batch-extract candidate memories from a conversation. The server handles dedup, scrubbing, and writing. The assistant\'s job is reduced to dumping "what seemed worth remembering."\n\n**Candidate shape note** — this command\'s `kind` field is a flat string (`"kind": "fact"`), and `rationale` / `language` are top-level fields. This differs from `memory.write`, where `kind` is a discriminated-union object and those fields nest inside it. Copying the write_memory shape will fail validation with `INVALID_INPUT`.\n\n**Topic-line gotcha** — for `preference` and `decision` candidates, the `content` MUST start with a `topic: value` line followed by a blank line and prose. The conflict detector parses that first line; without it, contradictory preferences silently coexist. The handler returns `INVALID_INPUT` for offending candidates — the whole batch is rejected, not just the bad items.\n\nDedup runs at two scopes: (1) **in-batch** — byte-identical candidates within the same call collapse to a single memory (kind-aware fingerprint); (2) **cross-batch** — embeddings are compared against existing active memories via the configured similarity thresholds (≥`extraction.dedup.identicalThreshold` skips, between that and `extraction.dedup.threshold` supersedes, below writes new). When in doubt, include the candidate.\n\n**Storage defaults** — extracted memories are written at `storedConfidence: 0.8` (lower than `memory.write`\'s 1.0) so they decay faster and get pruned if never confirmed. This biases toward precision: tentative captures don\'t crowd out user-stated facts.\n\nThe response carries a `mode` field. When `mode: "sync"`, the `written`, `skipped`, and `superseded` arrays are authoritative and you can report them directly. When `mode: "async"` (the default per `extraction.processing` config), those arrays are intentionally empty — the server returned a receipt and is processing in background. The accompanying `hint` field explains what to expect; do not retry. Writes land as memories within ~1–5 seconds and can be confirmed with `list_memories` or `search_memory` if needed.\n\nExample (note the flat kind, the topic-line on the preference, and top-level rationale on the decision):\n\n```json\n{"candidates":[\n {"kind":"preference","content":"editor-theme: dark\\n\\nUser prefers dark mode in all editors."},\n {"kind":"fact","content":"The production database is PostgreSQL 15."},\n {"kind":"decision","content":"storage-engine: SQLite\\n\\nChosen for the local-first story; FTS5 built in.","rationale":"Single-file, no daemon, prebuilt for every platform."},\n {"kind":"snippet","content":"memento read ","language":"shell"}\n]}\n```', mcpName: 'extract_memory', }, handler: async (input, ctx) => { diff --git a/packages/core/src/storage/migrations/0008_fts_porter_tokenizer.ts b/packages/core/src/storage/migrations/0008_fts_porter_tokenizer.ts new file mode 100644 index 0000000..4520fc6 --- /dev/null +++ b/packages/core/src/storage/migrations/0008_fts_porter_tokenizer.ts @@ -0,0 +1,142 @@ +// Migration 0008: rebuild `memories_fts` with porter+unicode61 tokenization. +// +// The FTS5 default tokenizer (`unicode61`) does no stemming. It treats +// `colleague`, `colleagues`, `colleague's`, and `colleagueship` as four +// unrelated tokens. For a prose memory layer this is the wrong trade: +// natural-language queries miss morphologically-similar matches that +// vector search rescues only partially, and "preserve the speaker's +// exact words" guidance compounds the problem because the speaker's +// words and the future query's words are rarely the same surface form. +// +// SQLite's `porter` tokenizer chains onto `unicode61`: unicode61 splits +// + diacritic-folds first, then porter reduces each token to a Porter +// stem. So `colleague`, `colleagues`, `colleague's` all index as the +// same stem, and a query for any of them matches all of them. +// +// FTS5 virtual tables cannot have their tokenizer changed in place — +// the only path is drop + rebuild. This migration mirrors 0005's +// drop-rebuild-repopulate-retrigger shape, but with the new tokenizer +// declaration. The `memories_fts_map` table is preserved so rowids +// stay stable across the rebuild. +// +// Trade-off accepted: porter occasionally over-stems +// (`organize`/`organic`, `universe`/`university`). For Memento's +// dominant query distribution — assistants asking about durable user +// state in natural language — recall on stem variants is worth more +// than precision on these edge cases. Operators who need the older +// behaviour can set `retrieval.fts.tokenizer` to `'unicode61'` in +// config (the key is the configurable knob this migration honours). +// +// Forward-only. No `down`. + +import { sql } from 'kysely'; +import type { Migration } from '../migrate.js'; + +export const migration0008FtsPorterTokenizer: Migration = { + name: '0008_fts_porter_tokenizer', + async up(db) { + // 1. Drop existing triggers; they reference the table we're about to drop. + await sql`drop trigger if exists memories_ai_fts`.execute(db); + await sql`drop trigger if exists memories_au_fts`.execute(db); + await sql`drop trigger if exists memories_bd_fts`.execute(db); + + // 2. Drop the old FTS virtual table (the shadow tables go with it). + // `memories_fts_map` survives — rowids are still valid. + await sql`drop table if exists memories_fts`.execute(db); + + // 3. Recreate with the porter+unicode61 tokenizer chain. Order matters: + // SQLite applies tokenizers right-to-left, so unicode61 splits + + // normalises (handling non-ASCII content) first, and porter then + // stems the resulting tokens. + await sql` + create virtual table memories_fts using fts5( + content, + summary, + tags, + content='', + tokenize='porter unicode61' + ) + `.execute(db); + + // 4. Re-populate from existing memories joined through the stable + // rowid map. Tags are space-joined from the JSON array so each + // tag indexes as a separate token (porter also stems tags, which + // is harmless: tag names are typically short, lowercase, and + // namespace-prefixed; stem collisions are unlikely). + await sql` + insert into memories_fts (rowid, content, summary, tags) + select + map.rowid, + m.content, + coalesce(m.summary, ''), + coalesce( + (select group_concat(value, ' ') from json_each(m.tags_json)), + '' + ) + from memories_fts_map map + join memories m on m.id = map.memory_id + `.execute(db); + + // 5. Recreate triggers — same shape as 0005, no schema change. + // The triggers reference the table by name, so they automatically + // pick up the new tokenizer on every insert/update/delete. + await sql` + create trigger memories_ai_fts after insert on memories begin + insert into memories_fts_map (memory_id) values (new.id); + insert into memories_fts (rowid, content, summary, tags) + values ( + (select rowid from memories_fts_map where memory_id = new.id), + new.content, + coalesce(new.summary, ''), + coalesce( + (select group_concat(value, ' ') from json_each(new.tags_json)), + '' + ) + ); + end + `.execute(db); + + await sql` + create trigger memories_au_fts after update of content, summary, tags_json on memories begin + insert into memories_fts (memories_fts, rowid, content, summary, tags) + values ( + 'delete', + (select rowid from memories_fts_map where memory_id = old.id), + old.content, + coalesce(old.summary, ''), + coalesce( + (select group_concat(value, ' ') from json_each(old.tags_json)), + '' + ) + ); + insert into memories_fts (rowid, content, summary, tags) + values ( + (select rowid from memories_fts_map where memory_id = new.id), + new.content, + coalesce(new.summary, ''), + coalesce( + (select group_concat(value, ' ') from json_each(new.tags_json)), + '' + ) + ); + end + `.execute(db); + + await sql` + create trigger memories_bd_fts before delete on memories begin + insert into memories_fts (memories_fts, rowid, content, summary, tags) + values ( + 'delete', + (select rowid from memories_fts_map where memory_id = old.id), + old.content, + coalesce(old.summary, ''), + coalesce( + (select group_concat(value, ' ') from json_each(old.tags_json)), + '' + ) + ); + delete from memories_fts_map where memory_id = old.id; + end + `.execute(db); + }, +}; diff --git a/packages/core/src/storage/migrations/index.ts b/packages/core/src/storage/migrations/index.ts index 796ced0..2291c9d 100644 --- a/packages/core/src/storage/migrations/index.ts +++ b/packages/core/src/storage/migrations/index.ts @@ -13,6 +13,7 @@ import { migration0004MemorySensitive } from './0004_memory_sensitive.js'; import { migration0005FtsAddTags } from './0005_fts_add_tags.js'; import { migration0006MemoryEventsImportedType } from './0006_memory_events_imported_type.js'; import { migration0007MemoriesStatusLcaIndex } from './0007_memories_status_lca_index.js'; +import { migration0008FtsPorterTokenizer } from './0008_fts_porter_tokenizer.js'; export const MIGRATIONS: readonly Migration[] = [ migration0001InitialSchema, @@ -22,4 +23,5 @@ export const MIGRATIONS: readonly Migration[] = [ migration0005FtsAddTags, migration0006MemoryEventsImportedType, migration0007MemoriesStatusLcaIndex, + migration0008FtsPorterTokenizer, ]; diff --git a/packages/core/test/storage/migrations/0008_fts_porter_tokenizer.test.ts b/packages/core/test/storage/migrations/0008_fts_porter_tokenizer.test.ts new file mode 100644 index 0000000..4572faf --- /dev/null +++ b/packages/core/test/storage/migrations/0008_fts_porter_tokenizer.test.ts @@ -0,0 +1,249 @@ +// Migration 0008: porter+unicode61 tokenization — end-to-end coverage. +// +// Verifies that: +// 1. The rebuilt FTS table matches stem variants (colleague / +// colleagues / colleague's resolve to the same stem, baking / +// bakes / baked resolve to the same stem). +// 2. Pre-migration memories are re-indexed under the new tokenizer. +// 3. Insert trigger uses the new tokenizer for new memories. +// 4. Update trigger picks up stem-equivalent content changes. +// 5. Delete trigger still purges FTS entries cleanly. +// 6. Non-ASCII content still tokenizes (porter chains onto +// unicode61, which handles diacritics and non-Latin scripts). + +import { afterEach, describe, expect, it } from 'vitest'; +import { openDatabase } from '../../../src/storage/database.js'; +import { migrateToLatest } from '../../../src/storage/migrate.js'; +import { MIGRATIONS } from '../../../src/storage/migrations/index.js'; + +interface OpenHandle { + close(): void; +} + +const handles: OpenHandle[] = []; + +afterEach(() => { + while (handles.length > 0) { + handles.pop()?.close(); + } +}); + +function open() { + const handle = openDatabase({ path: ':memory:' }); + handles.push(handle); + return handle; +} + +async function migrate(handle: ReturnType): Promise { + await migrateToLatest(handle.db, MIGRATIONS); +} + +function insertMemory( + handle: ReturnType, + id: string, + content: string, + tags: string[] = [], +): void { + handle.raw + .prepare( + `insert into memories ( + id, created_at, schema_version, scope_type, scope_json, + owner_type, owner_id, kind_type, kind_json, tags_json, + pinned, content, summary, status, stored_confidence, + last_confirmed_at, supersedes, superseded_by, embedding_json + ) values (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)`, + ) + .run( + id, + '2026-04-25T00:00:00.000Z', + 1, + 'global', + '{"type":"global"}', + 'local', + 'self', + 'fact', + '{"type":"fact"}', + JSON.stringify(tags), + 0, + content, + null, + 'active', + 0.9, + '2026-04-25T00:00:00.000Z', + null, + null, + null, + ); +} + +function ftsMatch(handle: ReturnType, query: string): string[] { + return ( + handle.raw + .prepare( + `select m.id from memories m + join memories_fts_map fm on fm.memory_id = m.id + join memories_fts ft on ft.rowid = fm.rowid + where memories_fts match ?`, + ) + .all(query) as { id: string }[] + ).map((r) => r.id); +} + +describe('0008_fts_porter_tokenizer', () => { + it('matches plural / possessive variants of the same stem', async () => { + const handle = open(); + await migrate(handle); + insertMemory( + handle, + '01H0000000000000000000000A', + "The user previously made a lemon poppyseed cake for a colleague's going-away party.", + ); + insertMemory(handle, '01H0000000000000000000000B', 'Some unrelated content about hiking.'); + + // Singular and plural both stem to the same root as the stored + // possessive form ("colleague's"). FTS5 MATCH syntax reserves the + // apostrophe so we can't pass it directly in a query, but the + // realistic direction is what matters: a future-question with + // "colleagues" finds a memory containing "colleague's". + expect(ftsMatch(handle, 'colleague')).toEqual(['01H0000000000000000000000A']); + expect(ftsMatch(handle, 'colleagues')).toEqual(['01H0000000000000000000000A']); + }); + + it('matches verb-form variants of the same stem', async () => { + const handle = open(); + await migrate(handle); + insertMemory( + handle, + '01H0000000000000000000000A', + 'The user has been baking cookies on the weekends.', + ); + + expect(ftsMatch(handle, 'bake')).toEqual(['01H0000000000000000000000A']); + expect(ftsMatch(handle, 'bakes')).toEqual(['01H0000000000000000000000A']); + expect(ftsMatch(handle, 'baked')).toEqual(['01H0000000000000000000000A']); + expect(ftsMatch(handle, 'baking')).toEqual(['01H0000000000000000000000A']); + }); + + it('re-indexes existing memories under the porter tokenizer', async () => { + // Simulate a pre-0008 install: run migrations up to 0007 with the + // old unicode61 tokenizer, write a memory, then run 0008. + const handle = open(); + const pre0008 = MIGRATIONS.slice(0, 7); + await migrateToLatest(handle.db, pre0008); + + handle.raw + .prepare( + `insert into memories ( + id, created_at, schema_version, scope_type, scope_json, + owner_type, owner_id, kind_type, kind_json, tags_json, + pinned, content, summary, status, stored_confidence, + last_confirmed_at, supersedes, superseded_by, embedding_json, sensitive + ) values (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)`, + ) + .run( + '01H0000000000000000000000A', + '2026-04-25T00:00:00.000Z', + 1, + 'global', + '{"type":"global"}', + 'local', + 'self', + 'fact', + '{"type":"fact"}', + '["baking","weekend"]', + 0, + 'The user has been baking cookies on the weekends.', + null, + 'active', + 0.9, + '2026-04-25T00:00:00.000Z', + null, + null, + null, + 0, + ); + + // Pre-0008 the row is indexed under unicode61 (literal "baking" + // and "weekends"). After 0008 the stem-equivalent queries should hit. + await migrateToLatest(handle.db, MIGRATIONS); + + expect(ftsMatch(handle, 'bake')).toEqual(['01H0000000000000000000000A']); + expect(ftsMatch(handle, 'weekend')).toEqual(['01H0000000000000000000000A']); + // Tags stem too; the stored tag is "baking", a query for "bakes" hits. + expect(ftsMatch(handle, 'tags:bakes')).toEqual(['01H0000000000000000000000A']); + }); + + it('insert trigger uses the new tokenizer for memories created post-migration', async () => { + const handle = open(); + await migrate(handle); + + // Insert AFTER migration runs — the insert trigger fires and the + // FTS row is created under porter. + insertMemory( + handle, + '01H0000000000000000000000A', + 'The user researched adoption agencies last spring.', + ); + + expect(ftsMatch(handle, 'research')).toEqual(['01H0000000000000000000000A']); + expect(ftsMatch(handle, 'researches')).toEqual(['01H0000000000000000000000A']); + expect(ftsMatch(handle, 'researched')).toEqual(['01H0000000000000000000000A']); + expect(ftsMatch(handle, 'agencies')).toEqual(['01H0000000000000000000000A']); + expect(ftsMatch(handle, 'agency')).toEqual(['01H0000000000000000000000A']); + }); + + it('update trigger re-indexes stem-equivalent content changes', async () => { + const handle = open(); + await migrate(handle); + insertMemory(handle, '01H0000000000000000000000A', 'The user was running every morning.'); + + expect(ftsMatch(handle, 'run')).toEqual(['01H0000000000000000000000A']); + + handle.raw + .prepare('update memories set content = ? where id = ?') + .run('The user switched to cycling every morning.', '01H0000000000000000000000A'); + + expect(ftsMatch(handle, 'run')).toEqual([]); + expect(ftsMatch(handle, 'cycling')).toEqual(['01H0000000000000000000000A']); + expect(ftsMatch(handle, 'cycles')).toEqual(['01H0000000000000000000000A']); + expect(ftsMatch(handle, 'cycled')).toEqual(['01H0000000000000000000000A']); + }); + + it('delete trigger removes FTS entries and the map row', async () => { + const handle = open(); + await migrate(handle); + insertMemory(handle, '01H0000000000000000000000A', 'A memory about gardening tomatoes.'); + + expect(ftsMatch(handle, 'garden')).toEqual(['01H0000000000000000000000A']); + + handle.raw.prepare('delete from memories where id = ?').run('01H0000000000000000000000A'); + + expect(ftsMatch(handle, 'garden')).toEqual([]); + const map = handle.raw + .prepare('select count(*) as c from memories_fts_map where memory_id = ?') + .get('01H0000000000000000000000A') as { c: number }; + expect(map.c).toBe(0); + }); + + it('preserves non-ASCII content (porter chains onto unicode61)', async () => { + const handle = open(); + await migrate(handle); + // Diacritics, German umlaut, Japanese — all should still be + // findable. unicode61's diacritic-folding fires before porter, + // so "café" indexes as "cafe" and remains queryable. + insertMemory( + handle, + '01H0000000000000000000000A', + 'The user visited a small café in München and ordered ラーメン.', + ); + + // Diacritic-folded ASCII match. + expect(ftsMatch(handle, 'cafe')).toEqual(['01H0000000000000000000000A']); + // Original diacritic-bearing form also hits (folding is symmetric). + expect(ftsMatch(handle, 'café')).toEqual(['01H0000000000000000000000A']); + // German umlaut form similar. + expect(ftsMatch(handle, 'munchen')).toEqual(['01H0000000000000000000000A']); + // Non-Latin script is unaffected by stemming and still indexes. + expect(ftsMatch(handle, 'ラーメン')).toEqual(['01H0000000000000000000000A']); + }); +}); diff --git a/packages/embedder-local/src/embedder.ts b/packages/embedder-local/src/embedder.ts index 872d2c5..aca8dfa 100644 --- a/packages/embedder-local/src/embedder.ts +++ b/packages/embedder-local/src/embedder.ts @@ -40,10 +40,30 @@ export const DEFAULT_LOCAL_DIMENSION = CONFIG_KEYS['embedder.local.dimension'].d */ export type EmbedFn = (text: string) => Promise; +/** + * Optional batched variant. The transformers.js v3 pipeline + * accepts an array input and returns a `[batch, dim]` tensor — + * one forward pass for the whole batch. When a loader exposes + * this, the embedder routes its `embedBatch` calls here instead + * of looping `embed`. The contract preserves order: result row + * `i` is the embedding of `texts[i]`. + */ +export type EmbedBatchFn = (texts: readonly string[]) => Promise; + +/** + * What a loader returns. `embed` is required; `embedBatch` is + * optional — loaders that don't expose it cause the embedder to + * fall back to looping `embed`, preserving the previous behaviour. + */ +export interface EmbedRuntime { + readonly embed: EmbedFn; + readonly embedBatch?: EmbedBatchFn; +} + /** * Pluggable initialiser. Receives the resolved model id and the - * resolved cache directory (when set), and returns a function - * that performs a single embedding. + * resolved cache directory (when set), and returns the runtime + * surface (single + optional batch). * * The default implementation (`createDefaultLoader`) wraps * `@huggingface/transformers`. Tests pass a fake to keep the @@ -52,7 +72,7 @@ export type EmbedFn = (text: string) => Promise; export type LocalEmbedderLoader = ( model: string, options: LocalEmbedderLoaderContext, -) => Promise; +) => Promise; export interface LocalEmbedderLoaderContext { readonly cacheDir?: string; @@ -152,11 +172,11 @@ export function createLocalEmbedder(options: LocalEmbedderOptions = {}): Embeddi // Single-flight init: every concurrent `embed` call awaits the // same promise, so the model is loaded exactly once even under - // a burst. We cache the *promise*, not the resolved fn, so a - // failed first init can be retried by replacing the cache. - let pending: Promise | undefined; + // a burst. We cache the *promise*, not the resolved runtime, so + // a failed first init can be retried by replacing the cache. + let pending: Promise | undefined; - const ensureReady = (): Promise => { + const ensureReady = (): Promise => { if (pending === undefined) { const attempt = loader(model, loaderContext); // If the loader rejects, clear the cache so the next call @@ -197,28 +217,47 @@ export function createLocalEmbedder(options: LocalEmbedderOptions = {}): Embeddi model, dimension, async embed(text: string): Promise { - const embedFn = await ensureReady(); - return validateVector(await runEmbed(embedFn, text, 'single'), 'single'); + const runtime = await ensureReady(); + return validateVector(await runEmbed(runtime.embed, text, 'single'), 'single'); }, async embedBatch(texts: readonly string[]): Promise { - const embedFn = await ensureReady(); - // Sequential under the hood for now — the transformers.js - // pipeline does not yet expose a true batch API for - // feature-extraction. The win is having the interface so - // callers batch upfront rather than interleaving embed + - // dedup per candidate. When transformers.js adds batching, - // this is the one place to change. + if (texts.length === 0) return []; + const runtime = await ensureReady(); + + // Fast path: the loader exposes a real batched implementation. + // transformers.js v3's feature-extraction pipeline accepts an + // array input and returns one [batch, dim] tensor in a single + // forward pass; the default loader uses that path. The + // wallclock cap applies to the whole batch as a single unit + // (callers shape the batch size — we don't reshape). + if (runtime.embedBatch !== undefined) { + const prepared = texts.map((t) => prepareText(t)); + const work = runtime.embedBatch(prepared); + const label = `batch[${texts.length}]`; + const vectors = + timeoutMs !== undefined ? await withTimeout(work, timeoutMs, label) : await work; + if (vectors.length !== texts.length) { + throw new Error( + `Local embedder batch returned ${vectors.length} vectors for ${texts.length} inputs (model='${model}'). The runtime did not preserve batch length.`, + ); + } + return vectors.map((v, i) => validateVector(v, `batch[${i}]`)); + } + + // Slow path: fall back to sequential `embed` calls. Preserves + // behaviour for loaders (notably the test fixtures) that only + // implement single-text embedding. const results: (readonly number[])[] = []; for (const text of texts) { const label = `batch[${results.length}]`; - results.push(validateVector(await runEmbed(embedFn, text, label), label)); + results.push(validateVector(await runEmbed(runtime.embed, text, label), label)); } return results; }, async warmup(): Promise { // Drive the single-flight init so the first user-facing // `embed()` is not stuck behind a model download / pipeline - // construction. We discard the resulting `EmbedFn` reference + // construction. We discard the resulting runtime reference // intentionally — `ensureReady` caches it. Timeouts and byte // caps deliberately do NOT apply here: warmup is fire-and- // forget at boot time and a partial model download must be @@ -266,7 +305,7 @@ export function createDefaultLoader(): LocalEmbedderLoader { // user wants smaller-and-faster at the cost of recall. const extractor = await runtime.pipeline('feature-extraction', repo, { dtype: 'fp32' }); - return async (text: string): Promise => { + const embed: EmbedFn = async (text) => { const output = await extractor(text, { pooling: 'mean', normalize: true, @@ -277,5 +316,38 @@ export function createDefaultLoader(): LocalEmbedderLoader { // contract pure-JS. return Array.from(output.data); }; + + // transformers.js v3's feature-extraction pipeline accepts an + // array input and returns a single `[batch, dim]` tensor — one + // forward pass for the whole batch instead of N. Confirmed + // numerically equivalent to looping the single-call form (row + // `i` matches a single call on `texts[i]`). The wallclock win + // grows with batch size: per-call ~30–50 ms on CPU vs amortised + // tokenisation + one inference pass for the batch. + const embedBatch: EmbedBatchFn = async (texts) => { + if (texts.length === 0) return []; + // Cast through `string[]` because the runtime's pipeline + // signature is overloaded for single + array but its TS type + // hides the array overload behind a generic. + const output = await extractor(texts as unknown as string, { + pooling: 'mean', + normalize: true, + }); + const dims = output.dims as readonly number[] | undefined; + const batch = dims?.[0] ?? texts.length; + const dim = dims?.[1] ?? Math.floor(output.data.length / texts.length); + if (batch !== texts.length) { + throw new Error( + `transformers.js batch output rows (${batch}) did not match input length (${texts.length}); refusing to slice and risk misalignment.`, + ); + } + const rows: number[][] = []; + for (let i = 0; i < batch; i += 1) { + rows.push(Array.from(output.data.slice(i * dim, (i + 1) * dim))); + } + return rows; + }; + + return { embed, embedBatch }; }; } diff --git a/packages/embedder-local/src/index.ts b/packages/embedder-local/src/index.ts index a83ef58..7181368 100644 --- a/packages/embedder-local/src/index.ts +++ b/packages/embedder-local/src/index.ts @@ -15,7 +15,9 @@ export { } from './embedder.js'; export type { + EmbedBatchFn, EmbedFn, + EmbedRuntime, LocalEmbedderLoader, LocalEmbedderLoaderContext, LocalEmbedderOptions, diff --git a/packages/embedder-local/test/embedder.test.ts b/packages/embedder-local/test/embedder.test.ts index 81505ef..3298ae4 100644 --- a/packages/embedder-local/test/embedder.test.ts +++ b/packages/embedder-local/test/embedder.test.ts @@ -3,6 +3,7 @@ import { afterEach, beforeEach, describe, expect, it, vi } from 'vitest'; import { DEFAULT_LOCAL_DIMENSION, DEFAULT_LOCAL_MODEL, + type EmbedBatchFn, type EmbedFn, type LocalEmbedderLoader, createLocalEmbedder, @@ -29,7 +30,8 @@ describe('createLocalEmbedder', () => { embedFn = vi.fn(async (_text: string) => buildVector(DEFAULT_LOCAL_DIMENSION, 0.1), ) as ReturnType & EmbedFn; - loader = vi.fn(async () => embedFn) as ReturnType & LocalEmbedderLoader; + loader = vi.fn(async () => ({ embed: embedFn })) as ReturnType & + LocalEmbedderLoader; }); afterEach(() => { @@ -71,9 +73,9 @@ describe('createLocalEmbedder', () => { it('forwards the configured model and cacheDir to the loader', async () => { const customEmbed: EmbedFn = async () => buildVector(384, 0.1); - const customLoader: ReturnType & LocalEmbedderLoader = vi.fn( - async () => customEmbed, - ); + const customLoader: ReturnType & LocalEmbedderLoader = vi.fn(async () => ({ + embed: customEmbed, + })); const provider = createLocalEmbedder({ loader: customLoader, model: 'all-MiniLM-L6-v2', @@ -95,14 +97,14 @@ describe('createLocalEmbedder', () => { it('throws when the produced vector length does not match the dimension', async () => { const wrongLength: EmbedFn = async () => buildVector(10, 0); - const wrongLoader: LocalEmbedderLoader = async () => wrongLength; + const wrongLoader: LocalEmbedderLoader = async () => ({ embed: wrongLength }); const provider = createLocalEmbedder({ loader: wrongLoader }); await expect(provider.embed('hi')).rejects.toThrow(/length 10, expected 768/); }); it('honours an overridden dimension when validating output', async () => { const tinyEmbed: EmbedFn = async () => buildVector(8, 0.5); - const tinyLoader: LocalEmbedderLoader = async () => tinyEmbed; + const tinyLoader: LocalEmbedderLoader = async () => ({ embed: tinyEmbed }); const provider = createLocalEmbedder({ loader: tinyLoader, dimension: 8 }); const v = await provider.embed('x'); expect(v).toHaveLength(8); @@ -137,8 +139,8 @@ describe('createLocalEmbedder', () => { it('does not consume the configured timeout', async () => { let resolveSlow: ((fn: EmbedFn) => void) | undefined; const slowLoader: LocalEmbedderLoader = () => - new Promise((resolve) => { - resolveSlow = resolve; + new Promise((resolve) => { + resolveSlow = (fn: EmbedFn) => resolve({ embed: fn }); }); const provider = createLocalEmbedder({ loader: slowLoader, timeoutMs: 50 }); const warmupPromise = provider.warmup?.(); @@ -157,7 +159,7 @@ describe('createLocalEmbedder', () => { if (calls === 1) { throw new Error('cold start failed'); } - return embedFn; + return { embed: embedFn }; }; const provider = createLocalEmbedder({ loader: flakyLoader }); await expect(provider.embed('first')).rejects.toThrow('cold start failed'); @@ -176,7 +178,7 @@ describe('createLocalEmbedder', () => { received = text; return buildVector(DEFAULT_LOCAL_DIMENSION, 0); }; - const captureLoader: LocalEmbedderLoader = async () => captureEmbed; + const captureLoader: LocalEmbedderLoader = async () => ({ embed: captureEmbed }); const provider = createLocalEmbedder({ loader: captureLoader, maxInputBytes: 16 }); await provider.embed('a'.repeat(64)); expect(received).toBe('a'.repeat(16)); @@ -188,7 +190,7 @@ describe('createLocalEmbedder', () => { received = text; return buildVector(DEFAULT_LOCAL_DIMENSION, 0); }; - const captureLoader: LocalEmbedderLoader = async () => captureEmbed; + const captureLoader: LocalEmbedderLoader = async () => ({ embed: captureEmbed }); const provider = createLocalEmbedder({ loader: captureLoader, maxInputBytes: 32 }); await provider.embed('hello'); expect(received).toBe('hello'); @@ -200,7 +202,7 @@ describe('createLocalEmbedder', () => { received = text; return buildVector(DEFAULT_LOCAL_DIMENSION, 0); }; - const captureLoader: LocalEmbedderLoader = async () => captureEmbed; + const captureLoader: LocalEmbedderLoader = async () => ({ embed: captureEmbed }); // 'é' is 2 UTF-8 bytes. Cap of 3 bytes leaves space for one // 'a' + 'é' = 3 bytes, with no partial codepoint. const provider = createLocalEmbedder({ loader: captureLoader, maxInputBytes: 3 }); @@ -217,7 +219,7 @@ describe('createLocalEmbedder', () => { new Promise((resolve) => { setTimeout(() => resolve(buildVector(DEFAULT_LOCAL_DIMENSION, 0)), 200); }); - const slowLoader: LocalEmbedderLoader = async () => slowEmbed; + const slowLoader: LocalEmbedderLoader = async () => ({ embed: slowEmbed }); const provider = createLocalEmbedder({ loader: slowLoader, timeoutMs: 50 }); await expect(provider.embed('x')).rejects.toThrow(/timed out after 50ms/u); }); @@ -227,5 +229,116 @@ describe('createLocalEmbedder', () => { const v = await provider.embed('x'); expect(v).toHaveLength(DEFAULT_LOCAL_DIMENSION); }); + + it('applies the timeout to the whole batch as one unit', async () => { + // The batch path takes a single wallclock cap — it's one + // runtime call from the embedder's perspective, not N. + const slowBatch: EmbedBatchFn = (texts) => + new Promise((resolve) => { + setTimeout(() => resolve(texts.map(() => buildVector(DEFAULT_LOCAL_DIMENSION, 0))), 200); + }); + const slowLoader: LocalEmbedderLoader = async () => ({ + embed: embedFn, + embedBatch: slowBatch, + }); + const provider = createLocalEmbedder({ loader: slowLoader, timeoutMs: 50 }); + await expect(provider.embedBatch!(['a', 'b', 'c'])).rejects.toThrow( + /batch\[3\] timed out after 50ms/u, + ); + }); + }); + + describe('embedBatch', () => { + it('uses the loader-provided batch fn when available (fast path)', async () => { + // Each row encodes its input index so the test can assert + // order preservation through the slice. + const batchFn = vi.fn(async (texts: readonly string[]) => + texts.map((_, i) => buildVector(DEFAULT_LOCAL_DIMENSION, i * 0.01)), + ); + const batchLoader: LocalEmbedderLoader = async () => ({ + embed: embedFn, + embedBatch: batchFn, + }); + const provider = createLocalEmbedder({ loader: batchLoader }); + const out = await provider.embedBatch!(['a', 'b', 'c']); + + expect(batchFn).toHaveBeenCalledTimes(1); + expect(batchFn).toHaveBeenCalledWith(['a', 'b', 'c']); + // Single-call path must NOT have been used when the batch fn + // is available — that's the whole point of the fast path. + expect(embedFn).not.toHaveBeenCalled(); + expect(out).toHaveLength(3); + expect(out[0]![0]).toBeCloseTo(0); + expect(out[1]![0]).toBeCloseTo(0.01); + expect(out[2]![0]).toBeCloseTo(0.02); + }); + + it('falls back to sequential embed when the loader omits embedBatch', async () => { + // The default `loader` fixture returns only { embed }. + const provider = createLocalEmbedder({ loader }); + const out = await provider.embedBatch!(['a', 'b', 'c']); + expect(embedFn).toHaveBeenCalledTimes(3); + expect(out).toHaveLength(3); + }); + + it('returns [] for empty input without invoking the loader', async () => { + const batchFn = vi.fn(async () => [] as readonly (readonly number[])[]); + const batchLoader = vi.fn(async () => ({ + embed: embedFn, + embedBatch: batchFn, + })) as ReturnType & LocalEmbedderLoader; + const provider = createLocalEmbedder({ loader: batchLoader }); + const out = await provider.embedBatch!([]); + expect(out).toEqual([]); + // Empty input short-circuits before the runtime is touched. + expect(batchLoader).not.toHaveBeenCalled(); + expect(batchFn).not.toHaveBeenCalled(); + }); + + it('rejects when the batch runtime returns the wrong row count', async () => { + // A misbehaving runtime that drops a row would corrupt the + // caller's input↔output alignment. Better to fail loud. + const dropsRow: EmbedBatchFn = async (texts) => + texts.slice(0, texts.length - 1).map(() => buildVector(DEFAULT_LOCAL_DIMENSION, 0)); + const badLoader: LocalEmbedderLoader = async () => ({ + embed: embedFn, + embedBatch: dropsRow, + }); + const provider = createLocalEmbedder({ loader: badLoader }); + await expect(provider.embedBatch!(['a', 'b', 'c'])).rejects.toThrow( + /returned 2 vectors for 3 inputs/u, + ); + }); + + it('validates each row in the batch against the configured dimension', async () => { + const wrongDim: EmbedBatchFn = async (texts) => texts.map(() => buildVector(10, 0)); + const wrongLoader: LocalEmbedderLoader = async () => ({ + embed: embedFn, + embedBatch: wrongDim, + }); + const provider = createLocalEmbedder({ loader: wrongLoader }); + // `EmbeddingProvider.embedBatch` is optional in the core + // contract but always provided by `createLocalEmbedder`. + await expect(provider.embedBatch!(['a'])).rejects.toThrow(/length 10, expected 768/u); + }); + + it('truncates oversize inputs on the batch path before the runtime sees them', async () => { + // The maxInputBytes guard applies to every row, batch or not. + let captured: readonly string[] | undefined; + const captureBatch: EmbedBatchFn = async (texts) => { + captured = texts; + return texts.map(() => buildVector(DEFAULT_LOCAL_DIMENSION, 0)); + }; + const captureLoader: LocalEmbedderLoader = async () => ({ + embed: embedFn, + embedBatch: captureBatch, + }); + const provider = createLocalEmbedder({ + loader: captureLoader, + maxInputBytes: 4, + }); + await provider.embedBatch!(['short', 'this is too long', 'fits']); + expect(captured).toEqual(['shor', 'this', 'fits']); + }); }); }); diff --git a/packages/landing/src/App.tsx b/packages/landing/src/App.tsx index bc46c77..94123f1 100644 --- a/packages/landing/src/App.tsx +++ b/packages/landing/src/App.tsx @@ -374,10 +374,45 @@ memory; treat chat as ephemeral. skipped:[], superseded:[], mode:"async", batchId, hint, status: "accepted"}\` — that is the receipt, not a failure. Writes land as memories within seconds; do not retry. +- \`extract_memory\`'s candidate shape is **flat** — \`kind\` is a + string (\`"kind":"fact"\`) and \`rationale\` / \`language\` are + top-level fields. This differs from \`write_memory\`, which uses + a discriminated-union \`kind\` object (\`"kind":{"type":"fact"}\`) + with those fields nested inside. Copying the write_memory shape + into an extract candidate produces \`INVALID_INPUT\` and rejects + the whole batch. - For preferences and decisions, start \`content\` with a single \`topic: value\` line followed by prose. Conflict detection parses that line; without it, contradictory preferences - silently coexist. + silently coexist. The same rule applies to both \`write_memory\` + and \`extract_memory\`. +- Distillation is **retrieval indexing**, not summarisation. The + future question may ask about any specific date, named entity, + proper noun, action, or object that came up — index every + concrete reference, don't capture the gist. +- Preserve specific terms when distilling. Use the speakers' exact + words for proper nouns, identity qualifiers, places, dates, and + the object of any action. "Raghu researched adoption agencies", + not "researched career options". "Transgender woman", not + "woman". Don't squash enumerations; emit each item or list them + explicitly. When in doubt, include — the server dedups. +- Capture facts about every named participant, not only the user. + If the user mentions someone ("my friend Alex is moving to + Berlin for a SAP job"), emit memories attributed to that named + person (Alex is moving to Berlin; Alex has a new job at SAP), + not collapsed onto the user. The future question may ask about + anyone named in the conversation. +- Emit a candidate for every dated event. If the user mentions an + event with an absolute date ("May 7") or a relative one + ("yesterday", "last Tuesday"), resolve to an absolute date and + emit it in the content. Don't fold dated events into untimed + habits — the future "when did X happen?" query can only be + answered by a memory that names the date. +- Capture precursor actions alongside outcomes. When the user + describes a sequence ("researched X then chose Y", "tried A and + settled on B"), emit both — a candidate for the precursor and a + candidate for the outcome. Future questions can target either + step; the outcome never erases the precursor. - Use the user's preferred name from \`info_system.user.preferredName\` when authoring memory content; fall back to "The user" when that field is null. diff --git a/packages/schema/src/config-keys.ts b/packages/schema/src/config-keys.ts index 37371fe..42bb437 100644 --- a/packages/schema/src/config-keys.ts +++ b/packages/schema/src/config-keys.ts @@ -281,10 +281,10 @@ export const CONFIG_KEYS = { // code changes (principle 1: configurable defaults). 'retrieval.fts.tokenizer': defineKey({ schema: z.enum(['unicode61', 'porter']), - default: 'unicode61', + default: 'porter', mutable: false, description: - 'FTS5 tokenizer for `memories_fts`. Pinned at server start because changing it requires a reindex.', + 'FTS5 tokenizer for `memories_fts`. `porter` stems tokens (so "colleagues" and "colleague\'s" share a stem and match each other in search) and is chained onto `unicode61` so non-ASCII content still tokenises correctly. `unicode61` alone is exact-token matching with no stemming — choose this only when proper-noun precision matters more than prose recall. Migration 0008 sets the FTS index up with porter; switching to `unicode61` requires a manual `drop table memories_fts` and reindex, which is why the key is immutable.', }), // BM25 `k1` / `b` are intentionally NOT registered. SQLite // FTS5 ships with the default values baked in and exposes no diff --git a/packages/schema/src/primitives.ts b/packages/schema/src/primitives.ts index d28ca57..a7cad73 100644 --- a/packages/schema/src/primitives.ts +++ b/packages/schema/src/primitives.ts @@ -94,7 +94,12 @@ export const TagSchema = z 'A tag string (1–64 chars). Trimmed and lowercased on ingest. Allowed characters: a-z, 0-9, "-", "_", "/", ".", ":". Examples: "project:memento", "lang-typescript", "config".', ) .transform((value) => value.trim().toLowerCase()) - .pipe(z.string().min(1).max(64).regex(TAG_PATTERN)) + .pipe( + z.string().min(1).max(64).regex(TAG_PATTERN, { + message: + 'Tag must start with a-z or 0-9 and contain only lowercase letters, digits, and `-`, `_`, `/`, `.`, `:`. Spaces, commas, and uppercase letters are not allowed (the value is auto-lowercased on ingest, so writing "Project:Memento" yields "project:memento" — but "April 15, 2026" is rejected because spaces and commas aren\'t in the charset). Pre-process human-prose values: replace runs of disallowed chars with `-` and lowercase.', + }), + ) .brand<'Tag'>(); export type Tag = z.infer; diff --git a/scripts/bench.mjs b/scripts/bench.mjs new file mode 100644 index 0000000..1f720ee --- /dev/null +++ b/scripts/bench.mjs @@ -0,0 +1,553 @@ +#!/usr/bin/env node +// scripts/bench.mjs — Memento ↔ memorybench harness driver. +// +// Drives the public memorybench harness (supermemoryai/memorybench) end +// to end against Memento. Builds Memento, stages the memorybench fork +// containing the Memento provider, sets `MEMENTO_BIN` to the locally +// built CLI, spawns `bun run src/index.ts run -p memento -b ` for +// each requested benchmark, and renders a markdown summary. +// +// Distinct from the other two harness scripts: +// +// - scripts/retrieval-eval.mjs — measures Memento's internal ranker +// on a small labeled needle set in a fresh in-memory SQLite. +// "Is the ranker returning the right memories?" +// - scripts/stress-test.mjs — measures engine throughput / latency +// at scale. "Is the engine fast, correct, stable at scale?" +// - scripts/bench.mjs (this file) — measures Memento on public +// industry datasets (LoCoMo, LongMemEval, ConvoMem) judged by an +// LLM. "How does Memento answer real long-conversation questions?" +// +// Usage: +// node scripts/bench.mjs # LoCoMo + LongMemEval, defaults +// node scripts/bench.mjs --benchmark=locomo --limit=5 # first-5 consecutive questions on LoCoMo +// node scripts/bench.mjs --sample=3 --sample-type=random # 3 per category, randomly chosen across conversations +// node scripts/bench.mjs --judge=gemini-2.5-pro # cross-family judge (vs default sonnet-4.6) +// node scripts/bench.mjs --search-limit=30 # provider top-K (env: MEMENTO_BENCH_SEARCH_LIMIT) +// node scripts/bench.mjs --concurrency-ingest=1 # serialize ingest (safe against Anthropic rate-limit overloads) +// node scripts/bench.mjs --memorybench-dir=/path/to/fork # use a local fork checkout (skip clone) +// node scripts/bench.mjs --memorybench-ref= # pin a specific fork ref +// node scripts/bench.mjs --out=./bench # output directory (default: ./bench) +// node scripts/bench.mjs --resume=memento-locomo-2026-05-14... # resume a crashed run by its runId +// +// Output: a single markdown file at `/.md` per invocation (and one +// per benchmark inside that file). The `--memorybench-dir`'s `data/runs//` +// holds the per-question JSON reports. Crashed runs print the resume command on +// the way out — run it from the same machine and the orchestrator picks up at +// the failed phase of the failed question, skipping all completed work. +// +// Architectural notes: +// - NOT part of `pnpm verify`. Needs network, judge API keys, and +// hours. AGENTS.md is explicit that verify must pass offline. +// - Every behavioral value is declared in `DEFAULTS` at the top of +// the file (architectural rule 2 — no hardcoded behavioral +// constants). Flags / env vars are documented in +// `docs/guides/benchmark.md`. +// - Reuses the gitInfo()-style reproducibility footer pattern from +// scripts/retrieval-eval.mjs so a future reader can diff runs. +// - Spawns child processes; doesn't import Memento packages directly. +// The provider lives in the memorybench fork. + +import { execSync, spawn } from 'node:child_process'; +import { existsSync, mkdirSync, mkdtempSync, readFileSync, rmSync, statSync } from 'node:fs'; +import { writeFile } from 'node:fs/promises'; +import { createRequire } from 'node:module'; +import { tmpdir } from 'node:os'; +import { join, resolve } from 'node:path'; + +// ----- Defaults / config ----- + +const DEFAULTS = { + // Upstream memorybench. Pre-merge of the Memento provider, point + // `--memorybench-dir` at a local checkout of the PR branch. Once + // the provider is merged into supermemoryai/memorybench:main, the + // default clone-and-run path just works. + memorybenchRepo: + process.env.MEMORYBENCH_REPO ?? 'https://github.com/supermemoryai/memorybench.git', + memorybenchRef: process.env.MEMORYBENCH_REF ?? 'main', + // Judge, answering, and distillation model all default to + // `sonnet-4.6`. Rationale: Memento is LLM-agnostic, but the bulk of + // MCP-using clients today (Claude Code, Cursor, Claude Desktop) put + // Claude Sonnet on the conversation side — which is *also* the model + // doing distillation in real Memento usage (extract_memory is called + // from the same assistant that's having the chat). Defaulting to + // sonnet-4.6 produces numbers that reflect what a real Memento user + // actually gets, not what a Flash-tier sidecar produces. Sonnet 4.6 + // supports temperature=0 (deterministic at the model layer) and is + // in the fork's MODEL_CONFIGS. Override via `--judge` / + // `--answering-model` / MEMENTO_DISTILL_MODEL for other families; + // an independent-family judge (e.g. `gpt-4o`) is the standard + // robustness check. + judgeModel: 'sonnet-4.6', + answeringModel: 'sonnet-4.6', + // First baseline: LoCoMo + LongMemEval. ConvoMem deferred. + benchmarks: ['locomo', 'longmemeval'], + outDir: null, // resolved at runtime to bench-output- +}; + +function ts() { + return new Date().toISOString().replace(/[:.]/g, '-').slice(0, 19); +} + +function parseArgs() { + const args = new Map(); + for (const a of process.argv.slice(2)) { + if (!a.startsWith('--')) continue; + const eq = a.indexOf('='); + if (eq > 0) args.set(a.slice(2, eq), a.slice(eq + 1)); + else args.set(a.slice(2), 'true'); + } + const benchmarkArg = args.get('benchmark'); + const benchmarks = benchmarkArg + ? benchmarkArg.split(',').map((s) => s.trim()) + : DEFAULTS.benchmarks; + const limitRaw = args.get('limit'); + const limit = limitRaw !== undefined ? Number(limitRaw) : undefined; + if (limit !== undefined && (!Number.isFinite(limit) || limit <= 0)) { + throw new Error(`--limit must be a positive integer (got: ${limitRaw})`); + } + const resumeRaw = args.get('resume'); + const resumeRunIds = + resumeRaw && resumeRaw !== 'true' + ? resumeRaw + .split(',') + .map((s) => s.trim()) + .filter(Boolean) + : []; + // When resuming, the runId encodes the benchmark and the original + // timestamp; use that as the ts so the rendered summary lands next + // to the original run's artifacts. + let tsStr = ts(); + if (resumeRunIds.length > 0) { + const parsed = parseRunId(resumeRunIds[0]); + if (parsed?.ts) tsStr = parsed.ts; + } + return { + ts: tsStr, + benchmarks, + limit, + judgeModel: args.get('judge') ?? DEFAULTS.judgeModel, + answeringModel: args.get('answering-model') ?? DEFAULTS.answeringModel, + memorybenchDir: args.get('memorybench-dir') ?? process.env.MEMORYBENCH_DIR ?? null, + memorybenchRepo: args.get('memorybench-repo') ?? DEFAULTS.memorybenchRepo, + memorybenchRef: args.get('memorybench-ref') ?? DEFAULTS.memorybenchRef, + // null = resolve against mementoRoot in main() (so a cwd inside the + // fork doesn't leak the output directory into the fork worktree). + outDir: args.get('out') ?? null, + concurrency: parseConcurrencyFlag(args.get('concurrency')), + concurrencyIngest: parseConcurrencyFlag(args.get('concurrency-ingest')), + concurrencyIndexing: parseConcurrencyFlag(args.get('concurrency-indexing')), + concurrencySearch: parseConcurrencyFlag(args.get('concurrency-search')), + concurrencyAnswer: parseConcurrencyFlag(args.get('concurrency-answer')), + concurrencyEvaluate: parseConcurrencyFlag(args.get('concurrency-evaluate')), + searchLimit: args.get('search-limit') ? Number(args.get('search-limit')) : undefined, + sample: args.get('sample') ? Number(args.get('sample')) : undefined, + sampleType: args.get('sample-type'), + resumeRunIds, + }; +} + +// runId format: `memento--` where is the ISO-derived +// timestamp from ts() above (e.g. `memento-locomo-2026-05-14T15-16-29`). +function parseRunId(runId) { + const m = /^memento-([a-z0-9-]+?)-(\d{4}-\d{2}-\d{2}T\d{2}-\d{2}-\d{2})$/.exec(runId); + if (!m) return null; + return { benchmark: m[1], ts: m[2] }; +} + +function parseConcurrencyFlag(raw) { + if (raw === undefined) return undefined; + const n = Number(raw); + if (!Number.isInteger(n) || n <= 0) { + throw new Error(`concurrency flags must be positive integers (got: ${raw})`); + } + return n; +} + +// ----- Helpers ----- + +function gitInfo(cwd) { + const opts = { encoding: 'utf8', stdio: ['pipe', 'pipe', 'ignore'], cwd }; + try { + const branch = execSync('git rev-parse --abbrev-ref HEAD', opts).trim(); + const sha = execSync('git rev-parse HEAD', opts).trim(); + const shaShort = execSync('git rev-parse --short HEAD', opts).trim(); + const dirty = execSync('git status --porcelain', opts).trim() !== ''; + return { branch, sha, shaShort, dirty }; + } catch { + return { branch: 'unknown', sha: 'unknown', shaShort: 'unknown', dirty: false }; + } +} + +function requireEnv(name, reason) { + const v = process.env[name]; + if (!v) { + console.error(`[bench] missing env var ${name}: ${reason}`); + console.error('[bench] see docs/guides/benchmark.md for the full setup.'); + process.exit(1); + } + return v; +} + +function judgeFamily(model) { + if (/^(sonnet|opus|haiku)-/.test(model)) return 'anthropic'; + if (/^gpt-/.test(model)) return 'openai'; + if (/^gemini-/.test(model)) return 'google'; + return 'unknown'; +} + +function checkApiKey(model, role) { + const family = judgeFamily(model); + if (family === 'anthropic') + requireEnv('ANTHROPIC_API_KEY', `${role} model ${model} is in the Anthropic family`); + else if (family === 'openai') + requireEnv('OPENAI_API_KEY', `${role} model ${model} is in the OpenAI family`); + else if (family === 'google') + requireEnv('GOOGLE_API_KEY', `${role} model ${model} is in the Google family`); + else + console.warn( + `[bench] warning: unknown ${role} model family for ${model}; not checking API key`, + ); +} + +// better-sqlite3 is a native module; loading it against a Node whose +// NODE_MODULE_VERSION differs from the one that built the workspace +// crashes the spawned server with an opaque "MCP error -32000: +// Connection closed". The common trigger is homebrew Node on PATH +// shadowing nvm's Node when running the script from a non-interactive +// shell. We check the same require resolution the server will use, +// so a mismatch fails here with a clear remedy. +function assertNativeAbiMatches(mementoRoot) { + try { + const requireFromCore = createRequire(join(mementoRoot, 'packages/core/package.json')); + requireFromCore('better-sqlite3'); + } catch (e) { + const firstLine = String(e?.message ?? e).split('\n')[0]; + console.error( + `[bench] better-sqlite3 failed to load under this Node (${process.version}, modules=${process.versions.modules}, execPath=${process.execPath}):`, + ); + console.error(`[bench] ${firstLine}`); + console.error('[bench] likely cause: the bench is running under a different Node than the one'); + console.error('[bench] that installed the workspace (e.g. homebrew Node on PATH vs nvm).'); + console.error('[bench] fix one of:'); + console.error('[bench] - invoke with the workspace Node directly, e.g.'); + console.error('[bench] /Users//.nvm/versions/node/v22.x/bin/node scripts/bench.mjs'); + console.error('[bench] - source nvm and re-select, e.g.'); + console.error('[bench] . "$NVM_DIR/nvm.sh" && nvm use && node scripts/bench.mjs'); + console.error('[bench] - rebuild against the current Node: pnpm rebuild better-sqlite3'); + process.exit(1); + } +} + +function spawnAwait(cmd, argv, opts) { + return new Promise((res, rej) => { + const child = spawn(cmd, argv, opts); + child.on('error', rej); + child.on('close', (code) => { + if (code === 0) res(); + else rej(new Error(`${cmd} ${argv.join(' ')} exited ${code}`)); + }); + }); +} + +// Memorybench's orchestrator writes its checkpoint after every phase +// boundary. After a crash, re-invoking with `--resume=` picks +// up at the failed phase of the failed question, skipping all +// completed work. Print the exact command so the user (or operator +// scanning the log) doesn't have to reconstruct it. +function printResumeHint(runId, opts) { + const cmd = [process.execPath, 'scripts/bench.mjs', `--resume=${runId}`]; + if (opts.memorybenchDir) cmd.push(`--memorybench-dir=${opts.memorybenchDir}`); + if (opts.concurrencyIngest) cmd.push(`--concurrency-ingest=${opts.concurrencyIngest}`); + console.error(''); + console.error('[bench] to resume, run from the Memento repo root:'); + console.error(`[bench] ${cmd.join(' ')}`); + console.error(''); +} + +function pct(n) { + return `${(n * 100).toFixed(1)}%`; +} + +function msFmt(n) { + if (n === undefined || n === null) return 'n/a'; + return `${Math.round(n)}ms`; +} + +// ----- Main ----- + +async function main() { + const opts = parseArgs(); + const mementoRoot = resolve(import.meta.dirname ?? new URL('.', import.meta.url).pathname, '..'); + + assertNativeAbiMatches(mementoRoot); + + checkApiKey(opts.judgeModel, 'judge'); + if (opts.answeringModel !== opts.judgeModel) checkApiKey(opts.answeringModel, 'answering'); + + console.error('[bench] building Memento packages…'); + execSync( + 'pnpm -F @psraghuveer/memento-schema -F @psraghuveer/memento-core ' + + '-F @psraghuveer/memento-server -F @psraghuveer/memento ' + + '-F @psraghuveer/memento-embedder-local build', + { stdio: 'inherit', cwd: mementoRoot }, + ); + const mementoBin = resolve(mementoRoot, 'packages/cli/dist/cli.js'); + if (!existsSync(mementoBin)) { + throw new Error(`built CLI not found at ${mementoBin}; build may have failed`); + } + + // Anchor default `--out` to the Memento repo root, not `process.cwd()`, + // so callers who `cd` into the fork before running don't get the bench + // output directory created inside the fork worktree. + if (opts.outDir === null) { + opts.outDir = resolve(mementoRoot, 'bench'); + } else { + opts.outDir = resolve(opts.outDir); + } + + // Stage memorybench. Either use a local checkout (preferred for + // development) or clone the fork at the pinned ref into a tmp dir. + let workdir; + let workdirOwned = false; + if (opts.memorybenchDir) { + workdir = resolve(opts.memorybenchDir); + if (!existsSync(workdir) || !statSync(workdir).isDirectory()) { + throw new Error(`--memorybench-dir not a directory: ${workdir}`); + } + console.error(`[bench] using local memorybench checkout: ${workdir}`); + } else { + workdir = mkdtempSync(join(tmpdir(), 'memento-bench-')); + workdirOwned = true; + console.error( + `[bench] cloning ${opts.memorybenchRepo} @ ${opts.memorybenchRef} into ${workdir}…`, + ); + execSync( + `git clone --depth 1 --branch ${opts.memorybenchRef} ${opts.memorybenchRepo} ${workdir}`, + { stdio: 'inherit' }, + ); + } + + // Always ensure deps are installed; bun install is idempotent and + // bun.lock guarantees stability across runs. + console.error('[bench] bun install (memorybench)…'); + execSync('bun install', { cwd: workdir, stdio: 'inherit' }); + + mkdirSync(opts.outDir, { recursive: true }); + // memorybench's `.gitignore` excludes `/data/`; keep our SQLite + // files there too so they don't pollute the worktree with + // untracked WAL artifacts the user has to ignore by hand. + const dbDir = join(workdir, 'data', 'memento-bench'); + mkdirSync(dbDir, { recursive: true }); + + // Build the run list. New invocation: one runId per requested + // benchmark, derived from the current timestamp. Resume invocation: + // use the caller-supplied runIds verbatim, derive the benchmark + // from each. + const resuming = opts.resumeRunIds.length > 0; + const targets = resuming + ? opts.resumeRunIds.map((runId) => { + const parsed = parseRunId(runId); + if (!parsed) { + throw new Error( + `--resume runId '${runId}' does not match expected shape memento--`, + ); + } + return { benchmark: parsed.benchmark, runId }; + }) + : opts.benchmarks.map((bench) => ({ + benchmark: bench, + runId: `memento-${bench}-${opts.ts}`, + })); + + const reports = []; + for (const { benchmark: bench, runId } of targets) { + const dbPath = join(dbDir, `${runId}.db`); + if (resuming) { + const cpPath = join(workdir, 'data', 'runs', runId, 'checkpoint.json'); + if (!existsSync(cpPath)) { + throw new Error( + `--resume: no checkpoint for ${runId} at ${cpPath} (was the original run staged against a different --memorybench-dir?)`, + ); + } + console.error(`[bench] RESUMING ${bench} (run=${runId})`); + } else { + console.error(`[bench] running ${bench} (run=${runId}, db=${dbPath})…`); + } + // Log the runId on its own line in a grep-friendly shape so it's + // recoverable from the bench log if the process dies before + // memorybench prints the failure summary. + console.error(`[bench] runId: ${runId}`); + + const env = { + ...process.env, + // Use process.execPath (the exact Node binary running bench.mjs) + // rather than the literal string 'node', so a `nvm` + `homebrew` + // PATH cocktail can't pick a Node whose better-sqlite3 ABI + // doesn't match the build cache. + MEMENTO_BIN: `${process.execPath} ${mementoBin}`, + MEMENTO_BENCH_DB: dbPath, + // The Memento provider uses an LLM for session-level + // distillation. Default to the answering model so cost / quality + // are consistent across one run; overridable in the parent env. + MEMENTO_DISTILL_MODEL: process.env.MEMENTO_DISTILL_MODEL ?? opts.answeringModel, + }; + if (opts.searchLimit) env.MEMENTO_BENCH_SEARCH_LIMIT = String(opts.searchLimit); + // Resume mode only needs `-r `. The checkpoint already + // carries benchmark, sampling, judge, and answering model; passing + // them again would either be redundant or (worse) silently + // disagree with what's stored. + const argv = resuming + ? ['run', 'src/index.ts', 'run', '-r', runId] + : [ + 'run', + 'src/index.ts', + 'run', + '-p', + 'memento', + '-b', + bench, + '-r', + runId, + '-j', + opts.judgeModel, + '-m', + opts.answeringModel, + ]; + if (!resuming && opts.limit) argv.push('-l', String(opts.limit)); + // `-s N` (memorybench's per-category sample size) spreads picks + // across all categories; pair with `--sample-type=random` to + // also spread across conversations within each category. Mutually + // exclusive with `--limit` on the memorybench side — if both are + // set memorybench prefers `sample`. + if (!resuming && opts.sample) argv.push('-s', String(opts.sample)); + if (!resuming && opts.sampleType) argv.push('--sample-type', opts.sampleType); + // Per-phase concurrency passthrough. `--concurrency` sets the + // default; the per-phase flags override it for that phase. Useful + // for taming the embedder during ingest on smaller machines. + if (opts.concurrency) argv.push('--concurrency', String(opts.concurrency)); + if (opts.concurrencyIngest) argv.push('--concurrency-ingest', String(opts.concurrencyIngest)); + if (opts.concurrencyIndexing) + argv.push('--concurrency-indexing', String(opts.concurrencyIndexing)); + if (opts.concurrencySearch) argv.push('--concurrency-search', String(opts.concurrencySearch)); + if (opts.concurrencyAnswer) argv.push('--concurrency-answer', String(opts.concurrencyAnswer)); + if (opts.concurrencyEvaluate) + argv.push('--concurrency-evaluate', String(opts.concurrencyEvaluate)); + + try { + await spawnAwait('bun', argv, { cwd: workdir, env, stdio: 'inherit' }); + } catch (err) { + printResumeHint(runId, opts); + throw err; + } + + const reportPath = join(workdir, 'data', 'runs', runId, 'report.json'); + if (!existsSync(reportPath)) { + printResumeHint(runId, opts); + throw new Error(`expected report not found at ${reportPath}`); + } + const report = JSON.parse(readFileSync(reportPath, 'utf8')); + reports.push({ benchmark: bench, runId, report, reportPath, dbPath }); + } + + // Render summary markdown. One file per invocation, named after the + // wall-clock timestamp; the directory is the user's `--out` (default: + // `/bench/`). Resumed runs land beside the original + // because parseArgs reused the runId's timestamp. + const memento = gitInfo(mementoRoot); + const fork = gitInfo(workdir); + const summaryPath = join(opts.outDir, `${opts.ts}.md`); + const summary = renderSummary({ + reports, + opts, + memento, + fork, + mementoBin, + }); + await writeFile(summaryPath, summary, 'utf8'); + + console.error('[bench] done.'); + console.error(`[bench] summary: ${summaryPath}`); + for (const r of reports) { + console.error(`[bench] ${r.benchmark}: ${r.reportPath}`); + } + + if (workdirOwned && process.env.MEMENTO_BENCH_KEEP_WORKDIR !== '1') { + console.error(`[bench] cleaning up ${workdir} (set MEMENTO_BENCH_KEEP_WORKDIR=1 to keep)`); + rmSync(workdir, { recursive: true, force: true }); + } +} + +function renderSummary({ reports, opts, memento, fork, mementoBin }) { + const lines = []; + lines.push('# Memento × memorybench — baseline run'); + lines.push(''); + lines.push(`Run at \`${opts.ts}\`.`); + lines.push(''); + lines.push('## Results'); + lines.push(''); + lines.push('| Benchmark | Total | Correct | Accuracy | MemScore | p50 search | p95 search |'); + lines.push('|---|---|---|---|---|---|---|'); + for (const r of reports) { + const s = r.report.summary ?? {}; + const lat = r.report.latency?.search ?? {}; + const ms = r.report.memscore ?? 'n/a'; + lines.push( + `| ${r.benchmark} | ${s.totalQuestions ?? '?'} | ${s.correctCount ?? '?'} | ` + + `${s.accuracy !== undefined ? pct(s.accuracy) : '?'} | \`${ms}\` | ` + + `${msFmt(lat.median)} | ${msFmt(lat.p95)} |`, + ); + } + lines.push(''); + lines.push('## Per-question-type breakdown'); + lines.push(''); + for (const r of reports) { + lines.push(`### ${r.benchmark}`); + lines.push(''); + const byType = r.report.byQuestionType ?? {}; + const types = Object.keys(byType); + if (types.length === 0) { + lines.push('*(no per-type stats reported)*'); + } else { + lines.push('| Type | Total | Correct | Accuracy |'); + lines.push('|---|---|---|---|'); + for (const t of types) { + const v = byType[t]; + lines.push( + `| ${t} | ${v.total} | ${v.correct} | ${v.accuracy !== undefined ? pct(v.accuracy) : '?'} |`, + ); + } + } + lines.push(''); + } + lines.push('## Reproducibility'); + lines.push(''); + lines.push('```'); + lines.push(`memento branch ${memento.branch}`); + lines.push(`memento sha ${memento.shaShort}${memento.dirty ? ' (dirty)' : ''}`); + lines.push(`memento bin ${mementoBin}`); + lines.push(`memorybench branch ${fork.branch}`); + lines.push(`memorybench sha ${fork.shaShort}${fork.dirty ? ' (dirty)' : ''}`); + lines.push(`memorybench repo ${opts.memorybenchRepo}`); + lines.push(`benchmarks ${opts.benchmarks.join(', ')}`); + lines.push(`judge model ${opts.judgeModel}`); + lines.push(`answering model ${opts.answeringModel}`); + if (opts.searchLimit !== undefined) lines.push(`search limit ${opts.searchLimit}`); + if (opts.limit !== undefined) lines.push(`limit ${opts.limit}`); + if (opts.concurrencyEvaluate !== undefined) + lines.push(`concurrency-evaluate ${opts.concurrencyEvaluate}`); + lines.push('```'); + lines.push(''); + lines.push('### Per-run reports'); + lines.push(''); + for (const r of reports) { + lines.push(`- \`${r.benchmark}\`: ${r.reportPath}`); + } + lines.push(''); + return lines.join('\n'); +} + +main().catch((e) => { + console.error(`[bench] fatal: ${e?.stack ?? e?.message ?? e}`); + process.exit(1); +}); diff --git a/skills/memento/SKILL.md b/skills/memento/SKILL.md index 50cdc56..a1b0225 100644 --- a/skills/memento/SKILL.md +++ b/skills/memento/SKILL.md @@ -53,12 +53,50 @@ If you loaded a memory speculatively and did not end up using it, do not confirm ### 3. Before you wrap up — extract what surfaced -Call `extract_memory` with a batch of candidates for anything durable that came up in the conversation but was not explicitly written. The server dedups against existing memories using embedding similarity, scrubs secrets, and writes the survivors with lower default confidence (so unconfirmed extractions decay faster than direct user statements). +Call `extract_memory` with a batch of candidates for anything durable that came up in the conversation but was not explicitly written. The server dedups against existing memories using embedding similarity, scrubs secrets, and writes the survivors with lower default confidence (`storedConfidence: 0.8` vs. `write_memory`'s `1.0`, so unconfirmed extractions decay faster than direct user statements). When in doubt, include the candidate. The server is the gatekeeper, not you. +**The candidate shape is different from `write_memory`.** `extract_memory` uses a flat candidate: `kind` is a plain string, `rationale` and `language` are top-level fields. `write_memory` uses a discriminated-union object (`kind: {type: "decision", rationale: "..."}`). Copying the write shape into an extract candidate produces `INVALID_INPUT` and rejects the entire batch, not just the offending item. + +```json +{ + "candidates": [ + {"kind": "preference", "content": "node-package-manager: pnpm\n\nRaghu prefers pnpm over npm for Node projects."}, + {"kind": "fact", "content": "The staging cluster lives at gke-staging."}, + {"kind": "decision", + "content": "storage-engine: SQLite\n\nChosen for the single-file local-first story.", + "rationale": "No daemon, FTS5 built in, prebuilt binaries on every common platform."}, + {"kind": "snippet", "content": "memento read ", "language": "shell"} + ] +} +``` + +Same `topic: value\n\nprose` rule applies to `preference` and `decision` candidates inside `extract_memory` as for `write_memory`. The conflict detector parses the first line; an offending candidate fails validation for the whole batch. + `extract_memory` returns a `mode` field: `'sync'` means the response arrays are authoritative (you can tell the user "I saved 3 things and skipped 1 duplicate" directly); `'async'` (the default per `extraction.processing` config) means the server accepted the batch and is processing in background — the response will look empty (`written: [], skipped: [], superseded: []`) but a `hint` field tells you what to do next, and the work lands as memories within ~1–5 seconds. Don't retry on an async response — it's a fire-and-forget receipt, not an error. +### Distillation craft: preserve specifics, bias toward inclusion + +"Distilled, not transcript" does not mean "summarised into generic categories." You are not summarising the conversation for a reader — you are producing **retrieval candidates for unknown future queries**. The right mental frame is "index every concrete reference," not "capture the gist." The future question may ask about any specific date, named entity, proper noun, action, or object that came up — including ones that feel incidental at write time. + +Six rules guard against the failure modes that make distilled memory unusable: + +1. **Preserve specific words.** Use the speakers' exact terms for proper nouns, named entities, identity qualifiers, places, dates, and the specific object of any action. The future question will use the specific term — a paraphrase makes the memory unfindable. + - The user mentions "**adoption agencies**" → "Raghu researched adoption agencies." Not "Raghu researched career options." + - The user describes themselves as a "**transgender** woman" → "Raghu is a transgender woman." Not "Raghu identifies as a woman." + - The user mentions "the **Wonderland Trail**" → name it. Not "a hiking trail." + - The user mentions "**May 7**" → resolve to an absolute date and emit it. Not "in spring." +2. **Capture facts about every named participant, not only the user.** A conversation may mention or include other people — a friend the user talks about, a colleague, a family member, or a co-speaker in a shared session. Facts those named people share about themselves AND the user's specific observations about them are both worth indexing, each attributed to the right person. The future question may ask about anyone named in the conversation, not just the primary user. + - "My friend Alex is moving to Berlin next month for a SAP job" → emit "Alex is moving to Berlin in " AND "Alex has a new job at SAP" (attributed to Alex, not to Raghu). + - In a meeting transcript where Sarah said "I have three kids" and the user said "I work from home" → both facts get captured, each attributed to its speaker. Don't bias toward the first speaker, the more talkative one, or the apparent "user" persona. +3. **Emit a candidate for every dated event — and resolve relative times.** If a message refers to an event with a resolvable date (whether absolute like "May 7" or relative like "yesterday" / "last Tuesday" / "two weeks ago" / "this morning"), emit a candidate for that event with the absolute date in the content. Resolve relative dates against the current date. "The user said yesterday they went to the conference" on 2026-05-08 → "On 2026-05-07, the user attended the conference." Do **not** generalise dated events into untimed habits — "the user attends conferences" loses the date and breaks future temporal queries. When in doubt, emit both: one timed-event candidate AND one general assertion candidate. The future "when did X happen?" question can only be answered by a memory that names the date. +4. **Capture precursor actions alongside outcomes.** When the user describes a sequence ("researched X then chose Y", "tried A and settled on B", "considered and picked "), emit a candidate for the precursor (the research, the try, the consideration) AND a candidate for the outcome. Future questions can target either step — "what did Raghu research?" and "what did Raghu choose?" have different answers and need different candidates. The outcome never erases the precursor. +5. **Don't squash enumerations.** If the user lists four activities (hiking, biking, swimming, pottery), emit four facts — or one fact that names all four explicitly — never one fact that says "outdoor activities and crafts." The benchmark for "did you capture this" is: can a later question that asks about exactly one of the four still find it? +6. **Bias toward inclusion.** When in doubt, emit the candidate. The server dedups via embedding similarity, so two near-equivalent candidates collapse to one row — the cost of over-including is low, the cost of under-including is that the fact is gone. Better 20 precise candidates than 5 broad ones. + +Before you finalise a `write_memory` or `extract_memory` call, do one pass over the conversation and check: does every (a) date or time-relative word, (b) proper noun / named entity, (c) action verb with a specific object map to at least one candidate? If a reference is missing, add it. These rules apply equally to direct `write_memory` calls and to `extract_memory` batches — the same paraphrase-loss can happen anywhere an LLM is mediating between conversation and memory. + ## What to write — and what not to **Write** durable assertions about the user, their preferences, their tools, their conventions, their projects, or their decisions. For `preference` and `decision` memories, **start the content with a `topic: value` line** before any prose — this single line is what conflict detection parses, so without it two contradictory preferences (e.g. "I use bun" vs "I use npm") will silently coexist instead of being surfaced for triage. Free prose can follow on subsequent lines for retrieval and human readability. @@ -181,10 +219,12 @@ When the four most-touched judgement calls come up, fall back to these one-line | Situation | Tool | Why | | --- | --- | --- | -| User explicitly states one durable thing ("remember X"). | `write_memory` | One round-trip. Explicit attribution. | +| User explicitly states one durable thing ("remember X"). | `write_memory` | One round-trip. Explicit attribution. Synchronous — the response is the receipt. | | User explicitly states several durable things in one breath ("remember A, B, and C"). | N × `write_memory` (sequential) | Each is independently true; one failing shouldn't roll the others back. Prefer this over `write_many_memories` unless you actually need atomicity. | -| End-of-session sweep — things the user mentioned in passing but didn't say "remember". | `extract_memory` | Server dedups, scrubs, lowers confidence (0.8). Async by default — fire and forget. | -| Bulk-loading from a paste / doc / migration where atomicity matters. | `write_many_memories` | Programmatic surface — rare in normal AI use; reach for it only when you genuinely need "all-or-nothing" semantics. | +| End-of-session sweep — things the user mentioned in passing but didn't say "remember". | `extract_memory` | Server dedups, scrubs, lowers confidence (0.8). Async by default — the response arrays will be empty by design; the work lands within ~1–5 s. **Candidate shape is flat (`kind: "fact"`), unlike write_memory's nested kind object.** | +| Bulk-loading from a paste / doc / migration where atomicity matters. | `write_many_memories` | Programmatic surface — rare in normal AI use; reach for it only when you genuinely need "all-or-nothing" semantics. Same nested `kind` shape as `write_memory`. | + +**Common confusion**: `write_memory` and `extract_memory` accept different candidate shapes for the same conceptual fields. Write uses a discriminated-union `kind`; extract uses a flat `kind` string with `rationale` / `language` at the top level. The tool descriptions in `tools/list` spell this out — if you're unsure, check them before composing the payload. ### Which kind?