fix(harbor): instruction advertises the free baseline eval, gated on sidecar support by shehabyasser-scale · Pull Request #26 · scaleapi/vero

shehabyasser-scale · 2026-07-03T18:23:10Z

Order-safe companion to #25. The free-baseline bullet renders only when the vero tree being compiled actually grants the free eval (introspected from StatusSummary), so this PR merges safely in any order relative to #25; the bullet activates once both chains are in.

Problem (two live findings, same root cause)

The compiled instruction under-informs the optimizer about budget economics:

The instruction contradicts feat(harbor): the agent's first baseline eval is budget-free #25. The auto_best paragraph (from docs(harbor): warn in the task instruction that baseline evals create no candidate #12) says baseline evals "spend budget without creating a candidate". Once feat(harbor): the agent's first baseline eval is budget-free #25 grants the first baseline eval budget-free, that wording actively steers the optimizer away from the free reference measurement. Observed live (GAIA weak-model run, 2026-07-03): the optimizer never claimed its free baseline (free_baseline_available: true the whole run) and produced three monotonically regressing candidates (0.25 -> 0.194 -> 0.167 on train) without ever learning that all of them were underwater vs the 0.317 baseline.
Optimizers quit with budget unspent. Two live runs ended with roughly half the eval budget unused (3 of 5; 3 of 7, plus 4 of 7 on the run above). Nothing tells the agent that unspent evals are wasted, or that re-measuring a noisy best candidate is a legitimate spend.

Fix

Reword the auto_best paragraph: "baseline evals create no candidate" (true both with and without feat(harbor): the agent's first baseline eval is budget-free #25) replaces the metering claim that feat(harbor): the agent's first baseline eval is budget-free #25 falsifies.
New Rules bullet (rendered only when the tree grants it): first baseline eval is budget-free, once per task not per split, take it on the selection split before the first candidate eval, stays free after commits (--commit <baseline-sha>), repeats are metered.
New Rules bullet (unconditional): scores are noisy; unspent budget is wasted; re-measure your best candidate or try one more variant rather than stopping early.
Compiler passes free_baseline into the template context by introspecting StatusSummary for free_baseline_available, so the instruction can never promise an unmetered eval the sidecar would meter (acting on that promise burns a metered eval on a commit auto_best cannot select; fatal on a total_run_budget: 1 task).

Tests

Two-armed, honest on every chain: test_instruction_advertises_free_baseline_eval (skips where the tree lacks #25), test_instruction_omits_free_baseline_claim_when_unsupported (the merge-order guard; skips where the tree has #25), test_instruction_tells_agent_to_spend_whole_budget (unconditional), and the submit-mode boundary test now pins the mode-agnostic bullets with == _HAS_FREE_BASELINE. Verified green on both arms: this branch (12 passed, 1 skipped) and the #25-merged integration tree (21 passed, 1 skipped); the tests were also verified to fail against the old template.

Review

Pre-open adversarial panel (3 lenses + blocker verification): 2 BLOCK verdicts, both the same verified merge-order hazard (an earlier draft rendered the bullet unconditionally); resolved by the conditional render + guard test above. Wording nits applied: once-per-task-not-per-split, freebie aimed at the selection split, removed a "(before any commits)" parenthetical that read as a validity condition.

Stacked on #12 (harbor-3-compiler-instruction-warning), whose warning text this updates.

🤖 Generated with Claude Code

Greptile Summary

This PR fixes the Harbor optimizer instruction so it accurately describes baseline-eval economics. It rewrites the auto_best "spends budget without creating a candidate" paragraph (corrected to "create no candidate"), adds a conditionally-rendered free-baseline bullet (only when the current tree's StatusSummary includes the free_baseline_available field from PR #25), and adds an unconditional "unspent budget is wasted" bullet to stop optimizers from quitting early.

Compiler (compiler.py): Introspects StatusSummary at build time via dataclasses.fields() and passes free_baseline (bool) into the Jinja2 template context; the coupling is named via _FREE_BASELINE_FIELD.
Template (instruction.md.j2): Two new rule bullets — the free-baseline bullet is gated on {% if free_baseline %} and the spend-the-budget bullet is unconditional; both appear regardless of submit_enabled.
Tests (test_harbor_build.py): Two-armed skipif coverage for the free-baseline bullet, a new unconditional spend-budget test, and an updated submit-mode boundary test — all tied to a module-level _HAS_FREE_BASELINE flag derived from the same StatusSummary introspection.

Confidence Score: 5/5

Safe to merge; the conditional render correctly prevents the instruction from advertising a free eval the sidecar does not yet grant.

The merge-order hazard was caught pre-open and resolved by introspecting StatusSummary at compile time. Both test arms are covered with skipif guards, the template whitespace is correct, and the unconditional spend-the-budget bullet is straightforward.

No files require special attention; the string-literal duplication in test_harbor_build.py is a maintenance nit that does not affect correctness on either merge-order branch today.

Important Files Changed

Filename	Overview
vero/src/vero/harbor/build/compiler.py	Adds `_FREE_BASELINE_FIELD` constant and `free_baseline` template-context entry computed via `dataclasses.fields(StatusSummary)`; logic is correct and merge-order safe.
vero/src/vero/harbor/build/templates/instruction.md.j2	Adds conditional free-baseline bullet and unconditional spend-the-budget bullet; template whitespace with `trim_blocks`/`lstrip_blocks` is correct.
vero/tests/test_harbor_build.py	Solid two-armed skipif coverage; `_HAS_FREE_BASELINE` duplicates the `"free_baseline_available"` string literal from the compiler rather than importing `_FREE_BASELINE_FIELD`, creating a silent divergence risk if the constant is ever renamed.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["compile_task()"] --> B["dataclasses.fields(StatusSummary)"]
    B --> C{"'free_baseline_available'\nin field names?"}
    C -- "Yes (PR #25 merged)" --> D["free_baseline = True"]
    C -- "No (PR #25 absent)" --> E["free_baseline = False"]
    D --> F["Render instruction.md.j2\nwith free-baseline bullet"]
    E --> G["Render instruction.md.j2\nwithout free-baseline bullet"]
    F --> H["Agent sees: budget-free first baseline eval + unspent budget is wasted"]
    G --> I["Agent sees: unspent budget is wasted (no free-eval promise)"]

    style D fill:#c8e6c9
    style E fill:#fff9c4
    style H fill:#c8e6c9
    style I fill:#fff9c4

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A["compile_task()"] --> B["dataclasses.fields(StatusSummary)"]
    B --> C{"'free_baseline_available'\nin field names?"}
    C -- "Yes (PR #25 merged)" --> D["free_baseline = True"]
    C -- "No (PR #25 absent)" --> E["free_baseline = False"]
    D --> F["Render instruction.md.j2\nwith free-baseline bullet"]
    E --> G["Render instruction.md.j2\nwithout free-baseline bullet"]
    F --> H["Agent sees: budget-free first baseline eval + unspent budget is wasted"]
    G --> I["Agent sees: unspent budget is wasted (no free-eval promise)"]

    style D fill:#c8e6c9
    style E fill:#fff9c4
    style H fill:#c8e6c9
    style I fill:#fff9c4

_{Reviews (2): Last reviewed commit: "refactor(harbor): name the free-baseline..." | Re-trigger Greptile}

…the sidecar actually granting it Two live-run findings, same root cause (the compiled instruction under-informs the optimizer about budget economics): 1. The auto_best paragraph claimed baseline evals 'spend budget without creating a candidate'. Once the sidecar grants the first baseline eval budget-free, that wording actively steers the optimizer away from the free reference measurement. Observed live: an optimizer produced three monotonically regressing candidates and never learned where zero was. 2. Two optimizers finished with nearly half their eval budget unspent. Nothing told them unspent evals are wasted or that re-measuring a noisy best candidate is a legitimate spend. The free-baseline bullet renders only when the vero tree being compiled actually has the feature (introspected from StatusSummary): the compiler and the free-baseline eval live on different PR chains, and an instruction that promises an unmetered eval the sidecar meters would send the agent to burn budget on a commit auto_best cannot select. Tests carry both arms. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… constant The compiler introspects StatusSummary for the free-baseline field name; that string was an unnamed literal, the sole coupling to the field PR #25 adds on a separate chain. Hoist it to _FREE_BASELINE_FIELD so the compiler<->protocol contract has one documented source. A hard import-time assert cannot live here: the field is legitimately absent on this branch's base until the sidecar PR merges, which is why the render is introspection-gated in the first place. Addresses Greptile P2 on this PR. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

shehabyasser-scale mentioned this pull request Jul 3, 2026

feat(harbor): the agent's first baseline eval is budget-free #25

Open

greptile-apps Bot reviewed Jul 3, 2026

View reviewed changes

Comment thread vero/src/vero/harbor/build/compiler.py Outdated

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(harbor): instruction advertises the free baseline eval, gated on sidecar support#26

fix(harbor): instruction advertises the free baseline eval, gated on sidecar support#26
shehabyasser-scale wants to merge 2 commits into
harbor-3-compiler-instruction-warningfrom
harbor-3-instruction-free-baseline

shehabyasser-scale commented Jul 3, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

shehabyasser-scale commented Jul 3, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem (two live findings, same root cause)

Fix

Tests

Review

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shehabyasser-scale commented Jul 3, 2026 •

edited by greptile-apps Bot

Loading