Skip to content

spec: local eval runner & vendor-neutral adapter layer#5

Open
sunilgattupalle wants to merge 2 commits into
mainfrom
feat/local-eval-runner-spec
Open

spec: local eval runner & vendor-neutral adapter layer#5
sunilgattupalle wants to merge 2 commits into
mainfrom
feat/local-eval-runner-spec

Conversation

@sunilgattupalle

Copy link
Copy Markdown
Collaborator

Summary

  • Adds specs/2026-06-16-local-eval-runner-design.md — full design spec for the local eval runner, source adapters, targets, importers, config runner, and CLI
  • Establishes the layered vendor-neutral architecture: core stays frozen, vendor adapters (Langfuse, Harness, etc.) live behind optional extras
  • Primary use case: edit prompt → harness-evals run my-eval.yaml → diff scores against baseline

Key design decisions

  • Four adapter family ABCs (BaseDatasetSource, BasePromptSource, BaseEvalCaseSource, BaseEvalConfigSource) + BaseTarget — all vendor-neutral
  • Seven plugin registries including _TARGETS, _METRICS, _BASELINE_STORES so third parties can extend every layer without forking
  • ResourceRef + dual-syntax resolve() (URI shorthand or typed block)
  • MissingAdapterError at config-load time (not a cryptic ImportError at execution)
  • HttpTarget v1: bearer/api_key/basic auth; OAuth/mTLS deferred
  • datasets.pydatasets/ package migration must be atomic with back-compat re-exports
  • Golden.input serialisation contract: json.dumps for non-string inputs in PromptTarget
  • gate_against_baseline() raises BaselineRegressionError; CLI exits non-zero

Reviewed by

  • Initial architecture review (Sonnet)
  • Independent full review (Opus) — five gaps identified and addressed in second commit

Test plan

  • Review specs/2026-06-16-local-eval-runner-design.md
  • Verify all adapter families, registries, and design decisions look correct before implementation begins

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

AI-Session-Id: 5f382a59-4928-47f9-bc96-a1159c13c50e
AI-Tool: claude-code
AI-Model: unknown
- Add _TARGETS, _METRICS, _BASELINE_STORES registries to plugins.py
- Add register_target, register_metric, register_baseline_store decorators
- Add target/metric/baseline columns to adapter registry table
- Define Golden.input serialisation contract in PromptTarget (json.dumps for non-str)
- Define gate_against_baseline() contract (BaselineRegressionError, CLI exit code)
- Specify datasets.py → datasets/ migration must be atomic with back-compat re-exports
- Clarify "zero-dependency" as zero external account; LLM metrics need [llm] extra

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

AI-Session-Id: 5f382a59-4928-47f9-bc96-a1159c13c50e
AI-Tool: claude-code
AI-Model: unknown
@CLAassistant

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants