Skip to content

fix(harbor): instruction advertises the free baseline eval, gated on sidecar support#26

Open
shehabyasser-scale wants to merge 2 commits into
harbor-3-compiler-instruction-warningfrom
harbor-3-instruction-free-baseline
Open

fix(harbor): instruction advertises the free baseline eval, gated on sidecar support#26
shehabyasser-scale wants to merge 2 commits into
harbor-3-compiler-instruction-warningfrom
harbor-3-instruction-free-baseline

Conversation

@shehabyasser-scale

@shehabyasser-scale shehabyasser-scale commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Order-safe companion to #25. The free-baseline bullet renders only when the vero tree being compiled actually grants the free eval (introspected from StatusSummary), so this PR merges safely in any order relative to #25; the bullet activates once both chains are in.

Problem (two live findings, same root cause)

The compiled instruction under-informs the optimizer about budget economics:

  1. The instruction contradicts feat(harbor): the agent's first baseline eval is budget-free #25. The auto_best paragraph (from docs(harbor): warn in the task instruction that baseline evals create no candidate #12) says baseline evals "spend budget without creating a candidate". Once feat(harbor): the agent's first baseline eval is budget-free #25 grants the first baseline eval budget-free, that wording actively steers the optimizer away from the free reference measurement. Observed live (GAIA weak-model run, 2026-07-03): the optimizer never claimed its free baseline (free_baseline_available: true the whole run) and produced three monotonically regressing candidates (0.25 -> 0.194 -> 0.167 on train) without ever learning that all of them were underwater vs the 0.317 baseline.
  2. Optimizers quit with budget unspent. Two live runs ended with roughly half the eval budget unused (3 of 5; 3 of 7, plus 4 of 7 on the run above). Nothing tells the agent that unspent evals are wasted, or that re-measuring a noisy best candidate is a legitimate spend.

Fix

  • Reword the auto_best paragraph: "baseline evals create no candidate" (true both with and without feat(harbor): the agent's first baseline eval is budget-free #25) replaces the metering claim that feat(harbor): the agent's first baseline eval is budget-free #25 falsifies.
  • New Rules bullet (rendered only when the tree grants it): first baseline eval is budget-free, once per task not per split, take it on the selection split before the first candidate eval, stays free after commits (--commit <baseline-sha>), repeats are metered.
  • New Rules bullet (unconditional): scores are noisy; unspent budget is wasted; re-measure your best candidate or try one more variant rather than stopping early.
  • Compiler passes free_baseline into the template context by introspecting StatusSummary for free_baseline_available, so the instruction can never promise an unmetered eval the sidecar would meter (acting on that promise burns a metered eval on a commit auto_best cannot select; fatal on a total_run_budget: 1 task).

Tests

Two-armed, honest on every chain: test_instruction_advertises_free_baseline_eval (skips where the tree lacks #25), test_instruction_omits_free_baseline_claim_when_unsupported (the merge-order guard; skips where the tree has #25), test_instruction_tells_agent_to_spend_whole_budget (unconditional), and the submit-mode boundary test now pins the mode-agnostic bullets with == _HAS_FREE_BASELINE. Verified green on both arms: this branch (12 passed, 1 skipped) and the #25-merged integration tree (21 passed, 1 skipped); the tests were also verified to fail against the old template.

Review

Pre-open adversarial panel (3 lenses + blocker verification): 2 BLOCK verdicts, both the same verified merge-order hazard (an earlier draft rendered the bullet unconditionally); resolved by the conditional render + guard test above. Wording nits applied: once-per-task-not-per-split, freebie aimed at the selection split, removed a "(before any commits)" parenthetical that read as a validity condition.

Stacked on #12 (harbor-3-compiler-instruction-warning), whose warning text this updates.

🤖 Generated with Claude Code

Greptile Summary

This PR fixes the Harbor optimizer instruction so it accurately describes baseline-eval economics. It rewrites the auto_best "spends budget without creating a candidate" paragraph (corrected to "create no candidate"), adds a conditionally-rendered free-baseline bullet (only when the current tree's StatusSummary includes the free_baseline_available field from PR #25), and adds an unconditional "unspent budget is wasted" bullet to stop optimizers from quitting early.

  • Compiler (compiler.py): Introspects StatusSummary at build time via dataclasses.fields() and passes free_baseline (bool) into the Jinja2 template context; the coupling is named via _FREE_BASELINE_FIELD.
  • Template (instruction.md.j2): Two new rule bullets — the free-baseline bullet is gated on {% if free_baseline %} and the spend-the-budget bullet is unconditional; both appear regardless of submit_enabled.
  • Tests (test_harbor_build.py): Two-armed skipif coverage for the free-baseline bullet, a new unconditional spend-budget test, and an updated submit-mode boundary test — all tied to a module-level _HAS_FREE_BASELINE flag derived from the same StatusSummary introspection.

Confidence Score: 5/5

Safe to merge; the conditional render correctly prevents the instruction from advertising a free eval the sidecar does not yet grant.

The merge-order hazard was caught pre-open and resolved by introspecting StatusSummary at compile time. Both test arms are covered with skipif guards, the template whitespace is correct, and the unconditional spend-the-budget bullet is straightforward.

No files require special attention; the string-literal duplication in test_harbor_build.py is a maintenance nit that does not affect correctness on either merge-order branch today.

Important Files Changed

Filename Overview
vero/src/vero/harbor/build/compiler.py Adds _FREE_BASELINE_FIELD constant and free_baseline template-context entry computed via dataclasses.fields(StatusSummary); logic is correct and merge-order safe.
vero/src/vero/harbor/build/templates/instruction.md.j2 Adds conditional free-baseline bullet and unconditional spend-the-budget bullet; template whitespace with trim_blocks/lstrip_blocks is correct.
vero/tests/test_harbor_build.py Solid two-armed skipif coverage; _HAS_FREE_BASELINE duplicates the "free_baseline_available" string literal from the compiler rather than importing _FREE_BASELINE_FIELD, creating a silent divergence risk if the constant is ever renamed.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["compile_task()"] --> B["dataclasses.fields(StatusSummary)"]
    B --> C{"'free_baseline_available'\nin field names?"}
    C -- "Yes (PR #25 merged)" --> D["free_baseline = True"]
    C -- "No (PR #25 absent)" --> E["free_baseline = False"]
    D --> F["Render instruction.md.j2\nwith free-baseline bullet"]
    E --> G["Render instruction.md.j2\nwithout free-baseline bullet"]
    F --> H["Agent sees: budget-free first baseline eval + unspent budget is wasted"]
    G --> I["Agent sees: unspent budget is wasted (no free-eval promise)"]

    style D fill:#c8e6c9
    style E fill:#fff9c4
    style H fill:#c8e6c9
    style I fill:#fff9c4
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A["compile_task()"] --> B["dataclasses.fields(StatusSummary)"]
    B --> C{"'free_baseline_available'\nin field names?"}
    C -- "Yes (PR #25 merged)" --> D["free_baseline = True"]
    C -- "No (PR #25 absent)" --> E["free_baseline = False"]
    D --> F["Render instruction.md.j2\nwith free-baseline bullet"]
    E --> G["Render instruction.md.j2\nwithout free-baseline bullet"]
    F --> H["Agent sees: budget-free first baseline eval + unspent budget is wasted"]
    G --> I["Agent sees: unspent budget is wasted (no free-eval promise)"]

    style D fill:#c8e6c9
    style E fill:#fff9c4
    style H fill:#c8e6c9
    style I fill:#fff9c4
Loading

Reviews (2): Last reviewed commit: "refactor(harbor): name the free-baseline..." | Re-trigger Greptile

…the sidecar actually granting it

Two live-run findings, same root cause (the compiled instruction under-informs
the optimizer about budget economics):

1. The auto_best paragraph claimed baseline evals 'spend budget without
   creating a candidate'. Once the sidecar grants the first baseline eval
   budget-free, that wording actively steers the optimizer away from the free
   reference measurement. Observed live: an optimizer produced three
   monotonically regressing candidates and never learned where zero was.
2. Two optimizers finished with nearly half their eval budget unspent. Nothing
   told them unspent evals are wasted or that re-measuring a noisy best
   candidate is a legitimate spend.

The free-baseline bullet renders only when the vero tree being compiled
actually has the feature (introspected from StatusSummary): the compiler and
the free-baseline eval live on different PR chains, and an instruction that
promises an unmetered eval the sidecar meters would send the agent to burn
budget on a commit auto_best cannot select. Tests carry both arms.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread vero/src/vero/harbor/build/compiler.py Outdated
… constant

The compiler introspects StatusSummary for the free-baseline field name; that
string was an unnamed literal, the sole coupling to the field PR #25 adds on a
separate chain. Hoist it to _FREE_BASELINE_FIELD so the compiler<->protocol
contract has one documented source. A hard import-time assert cannot live here:
the field is legitimately absent on this branch's base until the sidecar PR
merges, which is why the render is introspection-gated in the first place.

Addresses Greptile P2 on this PR.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant