fix(harbor): auto_best reverts to the baseline when no candidate beats it#28
Open
shehabyasser-scale wants to merge 1 commit into
Open
fix(harbor): auto_best reverts to the baseline when no candidate beats it#28shehabyasser-scale wants to merge 1 commit into
shehabyasser-scale wants to merge 1 commit into
Conversation
…s it auto_best excludes base_commit from the candidate pool, so when every candidate regressed it still selected the least-bad one and shipped a regression (observed live: an opus optimizer on a weak haiku inner model produced only below-baseline candidates and finalize shipped one 0.10 below the baseline, even though the free baseline reference was available). Visibility alone did not prevent the harm; nothing acted on it. Add a selection floor: after the admin re-score picks the best candidate, admin- score the untouched base_commit on the selection split and revert to it when the best candidate does not strictly beat it (a statistical tie reverts too: if the optimizer cannot show an improvement, shipping the seed is the safe outcome). On by default, gated on a base_commit being set; costs one extra admin eval. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem (observed live, twice)
auto_bestexcludesbase_commitfrom the candidate pool, so when every candidate regressed it still selected the least-bad one and shipped a regression. Observed in two independent weak-inner-model trials: candidates 0.25/0.194/0.167 shipped 0.267 vs a 0.367 baseline; then, with the free-baseline reference available and actually claimed by the optimizer, candidates 0.25/0.222/0.167 still shipped 0.167 vs a 0.267 baseline. Knowing where zero is did not stop the harm, because nothing acted on it: visibility without a selection floor.Fix
A selection floor in
_best_from_db: after the admin re-score picks the best candidate, admin-score the untouchedbase_commiton the selection split and revert to the seed when the best candidate does not strictly beat it. A statistical tie also reverts: if the optimizer cannot demonstrate an improvement, shipping the unmodified seed is the safe outcome for the customer.auto_best_baseline_floor(default on), wired throughServeConfig; no-ops whenbase_commitis unsetTests
<=boundary against a refactor to<)base_commitsilently no-ops (never evaluatescommit=None)Stacked on #27 (
harbor-2-baseline-durable). An adversarial review panel (4 lenses + verification) ran pre-open; all confirmed findings addressed.🤖 Generated with Claude Code
Greptile Summary
This PR adds a selection floor to
auto_bestmode: after admin re-scoring the candidate shortlist, the untouchedbase_commitis also admin-scored on the selection split, and the winner is reverted to the seed if it does not strictly beat it. The fix closes a real production path where every candidate regressed but the least-bad one was still shipped.verifier.py: Newauto_best_baseline_floorflag (defaultTrue); in_best_from_db, after picking the best candidate, the base commit is admin-evaluated on the selection split and a<=comparison triggers a revert tobase_commit. ANonebase_commit silently no-ops.serve.py:ServeConfigexposes the new flag and wires it throughbuild_components.test_harbor_verifier.py: Five new tests cover revert-on-regression, exact-tie revert, floor no-op without base_commit, candidate-beats-baseline keep, and floor-off backward compatibility; existing tests updated to setauto_best_baseline_floor=Falsewhere they isolate unrelated ranking behavior.Confidence Score: 4/5
The floor logic is sound and closes a real regression path; the one edge case (cross-dataset floor comparison when selection_dataset_id is unconfigured and the admin re-scoring reorders the shortlist) is unlikely to fire in well-configured deployments and has no effect when selection_dataset_id is set.
The core revert logic, the <= tie-break, and the base_commit=None no-op are all correct and well-tested. The dataset_id used for the base floor eval is borrowed from the top recorded-score shortlist entry rather than from the actual winner's row, which can make the comparison cross-dataset in the unconstrained-dataset scenario.
The dataset_id fallback in the floor block of vero/src/vero/harbor/verifier.py (lines 304-306) deserves a second look if selection_dataset_id is ever left unset in production configs.
Important Files Changed
_best_from_db; one edge case in thebase_dataset_idfallback can produce a cross-dataset floor comparison whenselection_dataset_idis None and the admin re-scoring promotes a non-first shortlist entry.auto_best_baseline_floor: bool = TruetoServeConfigand passes it through toVerifier; straightforward wiring, no issues.auto_best_baseline_floor=Falseto isolate their original concerns.Sequence Diagram
%%{init: {'theme': 'neutral'}}%% sequenceDiagram participant F as finalize() participant B as _best_from_db() participant E as EvaluationEngine F->>B: _best_from_db() B->>E: get_experiments_df() E-->>B: df note over B: filter to selection_split,<br/>exclude base_commit,<br/>shortlist top-K by recorded score loop for each shortlisted candidate B->>E: "evaluate_admin(commit=candidate, split=selection_split)" E-->>B: admin_score end note over B: sort by admin_score -> best_commit alt "auto_best_baseline_floor=True AND base_commit set" B->>E: "evaluate_admin(commit=base_commit, split=selection_split)" E-->>B: base_score alt "best_score <= base_score" B-->>F: return base_commit (revert to seed) else "best_score > base_score" B-->>F: return best_commit end else floor disabled or no base_commit B-->>F: return best_commit end loop for each VerificationTarget F->>E: "evaluate_admin(commit=sha, split=target.split)" E-->>F: target score end%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%% sequenceDiagram participant F as finalize() participant B as _best_from_db() participant E as EvaluationEngine F->>B: _best_from_db() B->>E: get_experiments_df() E-->>B: df note over B: filter to selection_split,<br/>exclude base_commit,<br/>shortlist top-K by recorded score loop for each shortlisted candidate B->>E: "evaluate_admin(commit=candidate, split=selection_split)" E-->>B: admin_score end note over B: sort by admin_score -> best_commit alt "auto_best_baseline_floor=True AND base_commit set" B->>E: "evaluate_admin(commit=base_commit, split=selection_split)" E-->>B: base_score alt "best_score <= base_score" B-->>F: return base_commit (revert to seed) else "best_score > base_score" B-->>F: return best_commit end else floor disabled or no base_commit B-->>F: return best_commit end loop for each VerificationTarget F->>E: "evaluate_admin(commit=sha, split=target.split)" E-->>F: target score endComments Outside Diff (1)
vero/src/vero/harbor/verifier.py, line 270-306 (link)selection_dataset_idisNone,base_dataset_idfalls back toshortlist.iloc[0].get("dataset_subset_dataset_id")— the dataset of the highest recorded-score candidate. But the admin re-scoring can promote a different shortlist entry asbest_commit, and each candidate is evaluated on its own row'sdataset_subset_dataset_id. When those two entries differ, the floor comparesbest_score(scored on the winner's dataset) againstbase_score(scored on the first shortlist entry's dataset) — a cross-dataset comparison that can falsely keep or falsely revert. Storing the dataset_id in therescoredtuple and using the winner's own dataset_id for the base eval keeps the comparison apples-to-apples.Prompt To Fix With AI
Prompt To Fix All With AI
Reviews (1): Last reviewed commit: "fix(harbor): auto_best reverts to the ba..." | Re-trigger Greptile