fix(harbor): instruction advertises the free baseline eval, gated on sidecar support#26
Open
shehabyasser-scale wants to merge 2 commits into
Conversation
…the sidecar actually granting it Two live-run findings, same root cause (the compiled instruction under-informs the optimizer about budget economics): 1. The auto_best paragraph claimed baseline evals 'spend budget without creating a candidate'. Once the sidecar grants the first baseline eval budget-free, that wording actively steers the optimizer away from the free reference measurement. Observed live: an optimizer produced three monotonically regressing candidates and never learned where zero was. 2. Two optimizers finished with nearly half their eval budget unspent. Nothing told them unspent evals are wasted or that re-measuring a noisy best candidate is a legitimate spend. The free-baseline bullet renders only when the vero tree being compiled actually has the feature (introspected from StatusSummary): the compiler and the free-baseline eval live on different PR chains, and an instruction that promises an unmetered eval the sidecar meters would send the agent to burn budget on a commit auto_best cannot select. Tests carry both arms. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… constant The compiler introspects StatusSummary for the free-baseline field name; that string was an unnamed literal, the sole coupling to the field PR #25 adds on a separate chain. Hoist it to _FREE_BASELINE_FIELD so the compiler<->protocol contract has one documented source. A hard import-time assert cannot live here: the field is legitimately absent on this branch's base until the sidecar PR merges, which is why the render is introspection-gated in the first place. Addresses Greptile P2 on this PR. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Order-safe companion to #25. The free-baseline bullet renders only when the vero tree being compiled actually grants the free eval (introspected from
StatusSummary), so this PR merges safely in any order relative to #25; the bullet activates once both chains are in.Problem (two live findings, same root cause)
The compiled instruction under-informs the optimizer about budget economics:
free_baseline_available: truethe whole run) and produced three monotonically regressing candidates (0.25 -> 0.194 -> 0.167 on train) without ever learning that all of them were underwater vs the 0.317 baseline.Fix
--commit <baseline-sha>), repeats are metered.free_baselineinto the template context by introspectingStatusSummaryforfree_baseline_available, so the instruction can never promise an unmetered eval the sidecar would meter (acting on that promise burns a metered eval on a commit auto_best cannot select; fatal on atotal_run_budget: 1task).Tests
Two-armed, honest on every chain:
test_instruction_advertises_free_baseline_eval(skips where the tree lacks #25),test_instruction_omits_free_baseline_claim_when_unsupported(the merge-order guard; skips where the tree has #25),test_instruction_tells_agent_to_spend_whole_budget(unconditional), and the submit-mode boundary test now pins the mode-agnostic bullets with== _HAS_FREE_BASELINE. Verified green on both arms: this branch (12 passed, 1 skipped) and the #25-merged integration tree (21 passed, 1 skipped); the tests were also verified to fail against the old template.Review
Pre-open adversarial panel (3 lenses + blocker verification): 2 BLOCK verdicts, both the same verified merge-order hazard (an earlier draft rendered the bullet unconditionally); resolved by the conditional render + guard test above. Wording nits applied: once-per-task-not-per-split, freebie aimed at the selection split, removed a "(before any commits)" parenthetical that read as a validity condition.
Stacked on #12 (
harbor-3-compiler-instruction-warning), whose warning text this updates.🤖 Generated with Claude Code
Greptile Summary
This PR fixes the Harbor optimizer instruction so it accurately describes baseline-eval economics. It rewrites the auto_best "spends budget without creating a candidate" paragraph (corrected to "create no candidate"), adds a conditionally-rendered free-baseline bullet (only when the current tree's
StatusSummaryincludes thefree_baseline_availablefield from PR #25), and adds an unconditional "unspent budget is wasted" bullet to stop optimizers from quitting early.compiler.py): IntrospectsStatusSummaryat build time viadataclasses.fields()and passesfree_baseline(bool) into the Jinja2 template context; the coupling is named via_FREE_BASELINE_FIELD.instruction.md.j2): Two new rule bullets — the free-baseline bullet is gated on{% if free_baseline %}and the spend-the-budget bullet is unconditional; both appear regardless ofsubmit_enabled.test_harbor_build.py): Two-armed skipif coverage for the free-baseline bullet, a new unconditional spend-budget test, and an updated submit-mode boundary test — all tied to a module-level_HAS_FREE_BASELINEflag derived from the sameStatusSummaryintrospection.Confidence Score: 5/5
Safe to merge; the conditional render correctly prevents the instruction from advertising a free eval the sidecar does not yet grant.
The merge-order hazard was caught pre-open and resolved by introspecting
StatusSummaryat compile time. Both test arms are covered with skipif guards, the template whitespace is correct, and the unconditional spend-the-budget bullet is straightforward.No files require special attention; the string-literal duplication in
test_harbor_build.pyis a maintenance nit that does not affect correctness on either merge-order branch today.Important Files Changed
_FREE_BASELINE_FIELDconstant andfree_baselinetemplate-context entry computed viadataclasses.fields(StatusSummary); logic is correct and merge-order safe.trim_blocks/lstrip_blocksis correct._HAS_FREE_BASELINEduplicates the"free_baseline_available"string literal from the compiler rather than importing_FREE_BASELINE_FIELD, creating a silent divergence risk if the constant is ever renamed.Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A["compile_task()"] --> B["dataclasses.fields(StatusSummary)"] B --> C{"'free_baseline_available'\nin field names?"} C -- "Yes (PR #25 merged)" --> D["free_baseline = True"] C -- "No (PR #25 absent)" --> E["free_baseline = False"] D --> F["Render instruction.md.j2\nwith free-baseline bullet"] E --> G["Render instruction.md.j2\nwithout free-baseline bullet"] F --> H["Agent sees: budget-free first baseline eval + unspent budget is wasted"] G --> I["Agent sees: unspent budget is wasted (no free-eval promise)"] style D fill:#c8e6c9 style E fill:#fff9c4 style H fill:#c8e6c9 style I fill:#fff9c4%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%% flowchart TD A["compile_task()"] --> B["dataclasses.fields(StatusSummary)"] B --> C{"'free_baseline_available'\nin field names?"} C -- "Yes (PR #25 merged)" --> D["free_baseline = True"] C -- "No (PR #25 absent)" --> E["free_baseline = False"] D --> F["Render instruction.md.j2\nwith free-baseline bullet"] E --> G["Render instruction.md.j2\nwithout free-baseline bullet"] F --> H["Agent sees: budget-free first baseline eval + unspent budget is wasted"] G --> I["Agent sees: unspent budget is wasted (no free-eval promise)"] style D fill:#c8e6c9 style E fill:#fff9c4 style H fill:#c8e6c9 style I fill:#fff9c4Reviews (2): Last reviewed commit: "refactor(harbor): name the free-baseline..." | Re-trigger Greptile