fix(harbor): auto_best reverts to the baseline when no candidate beats it by shehabyasser-scale · Pull Request #28 · scaleapi/vero

shehabyasser-scale · 2026-07-04T15:38:54Z

Problem (observed live, twice)

auto_best excludes base_commit from the candidate pool, so when every candidate regressed it still selected the least-bad one and shipped a regression. Observed in two independent weak-inner-model trials: candidates 0.25/0.194/0.167 shipped 0.267 vs a 0.367 baseline; then, with the free-baseline reference available and actually claimed by the optimizer, candidates 0.25/0.222/0.167 still shipped 0.167 vs a 0.267 baseline. Knowing where zero is did not stop the harm, because nothing acted on it: visibility without a selection floor.

Fix

A selection floor in _best_from_db: after the admin re-score picks the best candidate, admin-score the untouched base_commit on the selection split and revert to the seed when the best candidate does not strictly beat it. A statistical tie also reverts: if the optimizer cannot demonstrate an improvement, shipping the unmodified seed is the safe outcome for the customer.

auto_best_baseline_floor (default on), wired through ServeConfig; no-ops when base_commit is unset
Uses the trusted admin scorer for the base eval (same trust posture as the shortlist re-score); costs one extra admin eval
Logged either way (revert or keep) with both scores

Tests

reverts when the best candidate loses to the baseline: asserts the emitted reward is the seed's target-split score (not the regression) and the final eval is the target eval of the reverted commit
exact tie reverts (pins the <= boundary against a refactor to <)
candidate that beats the baseline is kept
floor off restores the old behavior (base never scored)
floor on with no base_commit silently no-ops (never evaluates commit=None)

Stacked on #27 (harbor-2-baseline-durable). An adversarial review panel (4 lenses + verification) ran pre-open; all confirmed findings addressed.

🤖 Generated with Claude Code

Greptile Summary

This PR adds a selection floor to auto_best mode: after admin re-scoring the candidate shortlist, the untouched base_commit is also admin-scored on the selection split, and the winner is reverted to the seed if it does not strictly beat it. The fix closes a real production path where every candidate regressed but the least-bad one was still shipped.

verifier.py: New auto_best_baseline_floor flag (default True); in _best_from_db, after picking the best candidate, the base commit is admin-evaluated on the selection split and a <= comparison triggers a revert to base_commit. A None base_commit silently no-ops.
serve.py: ServeConfig exposes the new flag and wires it through build_components.
test_harbor_verifier.py: Five new tests cover revert-on-regression, exact-tie revert, floor no-op without base_commit, candidate-beats-baseline keep, and floor-off backward compatibility; existing tests updated to set auto_best_baseline_floor=False where they isolate unrelated ranking behavior.

Confidence Score: 4/5

The floor logic is sound and closes a real regression path; the one edge case (cross-dataset floor comparison when selection_dataset_id is unconfigured and the admin re-scoring reorders the shortlist) is unlikely to fire in well-configured deployments and has no effect when selection_dataset_id is set.

The core revert logic, the <= tie-break, and the base_commit=None no-op are all correct and well-tested. The dataset_id used for the base floor eval is borrowed from the top recorded-score shortlist entry rather than from the actual winner's row, which can make the comparison cross-dataset in the unconstrained-dataset scenario.

The dataset_id fallback in the floor block of vero/src/vero/harbor/verifier.py (lines 304-306) deserves a second look if selection_dataset_id is ever left unset in production configs.

Important Files Changed

Filename	Overview
vero/src/vero/harbor/verifier.py	Adds the baseline floor in `_best_from_db`; one edge case in the `base_dataset_id` fallback can produce a cross-dataset floor comparison when `selection_dataset_id` is None and the admin re-scoring promotes a non-first shortlist entry.
vero/src/vero/harbor/serve.py	Adds `auto_best_baseline_floor: bool = True` to `ServeConfig` and passes it through to `Verifier`; straightforward wiring, no issues.
vero/tests/test_harbor_verifier.py	Five well-structured new tests cover the key floor boundaries (revert, tie, no-op, keep, floor-off); existing tests correctly updated with `auto_best_baseline_floor=False` to isolate their original concerns.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant F as finalize()
    participant B as _best_from_db()
    participant E as EvaluationEngine

    F->>B: _best_from_db()
    B->>E: get_experiments_df()
    E-->>B: df
    note over B: filter to selection_split,<br/>exclude base_commit,<br/>shortlist top-K by recorded score
    loop for each shortlisted candidate
        B->>E: "evaluate_admin(commit=candidate, split=selection_split)"
        E-->>B: admin_score
    end
    note over B: sort by admin_score -> best_commit

    alt "auto_best_baseline_floor=True AND base_commit set"
        B->>E: "evaluate_admin(commit=base_commit, split=selection_split)"
        E-->>B: base_score
        alt "best_score <= base_score"
            B-->>F: return base_commit (revert to seed)
        else "best_score > base_score"
            B-->>F: return best_commit
        end
    else floor disabled or no base_commit
        B-->>F: return best_commit
    end

    loop for each VerificationTarget
        F->>E: "evaluate_admin(commit=sha, split=target.split)"
        E-->>F: target score
    end

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant F as finalize()
    participant B as _best_from_db()
    participant E as EvaluationEngine

    F->>B: _best_from_db()
    B->>E: get_experiments_df()
    E-->>B: df
    note over B: filter to selection_split,<br/>exclude base_commit,<br/>shortlist top-K by recorded score
    loop for each shortlisted candidate
        B->>E: "evaluate_admin(commit=candidate, split=selection_split)"
        E-->>B: admin_score
    end
    note over B: sort by admin_score -> best_commit

    alt "auto_best_baseline_floor=True AND base_commit set"
        B->>E: "evaluate_admin(commit=base_commit, split=selection_split)"
        E-->>B: base_score
        alt "best_score <= base_score"
            B-->>F: return base_commit (revert to seed)
        else "best_score > base_score"
            B-->>F: return best_commit
        end
    else floor disabled or no base_commit
        B-->>F: return best_commit
    end

    loop for each VerificationTarget
        F->>E: "evaluate_admin(commit=sha, split=target.split)"
        E-->>F: target score
    end

Comments Outside Diff (1)

vero/src/vero/harbor/verifier.py, line 270-306 (link)

When selection_dataset_id is None, base_dataset_id falls back to shortlist.iloc[0].get("dataset_subset_dataset_id") — the dataset of the highest recorded-score candidate. But the admin re-scoring can promote a different shortlist entry as best_commit, and each candidate is evaluated on its own row's dataset_subset_dataset_id. When those two entries differ, the floor compares best_score (scored on the winner's dataset) against base_score (scored on the first shortlist entry's dataset) — a cross-dataset comparison that can falsely keep or falsely revert. Storing the dataset_id in the rescored tuple and using the winner's own dataset_id for the base eval keeps the comparison apples-to-apples.

Prompt To Fix With AI

This is a comment left during a code review.
Path: vero/src/vero/harbor/verifier.py
Line: 270-306

Comment:
When `selection_dataset_id` is `None`, `base_dataset_id` falls back to `shortlist.iloc[0].get("dataset_subset_dataset_id")` — the dataset of the highest *recorded-score* candidate. But the admin re-scoring can promote a different shortlist entry as `best_commit`, and each candidate is evaluated on its own row's `dataset_subset_dataset_id`. When those two entries differ, the floor compares `best_score` (scored on the winner's dataset) against `base_score` (scored on the first shortlist entry's dataset) — a cross-dataset comparison that can falsely keep or falsely revert. Storing the dataset_id in the `rescored` tuple and using the winner's own dataset_id for the base eval keeps the comparison apples-to-apples.



How can I resolve this? If you propose a fix, please make it concise.

Prompt To Fix All With AI

Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
vero/src/vero/harbor/verifier.py:270-306
When `selection_dataset_id` is `None`, `base_dataset_id` falls back to `shortlist.iloc[0].get("dataset_subset_dataset_id")` — the dataset of the highest *recorded-score* candidate. But the admin re-scoring can promote a different shortlist entry as `best_commit`, and each candidate is evaluated on its own row's `dataset_subset_dataset_id`. When those two entries differ, the floor compares `best_score` (scored on the winner's dataset) against `base_score` (scored on the first shortlist entry's dataset) — a cross-dataset comparison that can falsely keep or falsely revert. Storing the dataset_id in the `rescored` tuple and using the winner's own dataset_id for the base eval keeps the comparison apples-to-apples.

```suggestion
        rescored: list[tuple[float, int, str, str | None]] = []
        for idx, (_, row) in enumerate(shortlist.iterrows()):
            commit = row["candidate_commit"]
            dataset_id = row.get("dataset_subset_dataset_id")
            exp = await self.engine.evaluate_admin(
                task=self.selection_task,
                dataset_id=dataset_id,
                split=self.selection_split,
                commit=commit,
            )
            score = exp.result.score()
            admin_score = float(score) if score is not None else default_minimum_score
            # Tie-break by shortlist position (already ordered by recorded score
            # then recency), so ties resolve deterministically without depending on
            # the type of candidate_created_at (a datetime in the real DB).
            rescored.append((admin_score, idx, commit, dataset_id))
            logger.info(
                "auto_best re-score: commit=%s admin_score=%s (recorded=%s)",
                commit,
                admin_score,
                row["mean_score"],
            )
        # Highest admin score wins; ties break to the earliest shortlist position.
        rescored.sort(key=lambda t: (-t[0], t[1]))
        best_score, _, best_commit, best_dataset_id = rescored[0]

        # Selection floor: never ship a candidate that fails to beat the untouched
        # baseline on the selection split. auto_best excludes base_commit from the
        # candidate pool, so without this it selects the least-bad candidate even
        # when every candidate regressed. Revert to the seed instead. Strict '>' so
        # a statistical tie also reverts: if the optimizer cannot show an
        # improvement, shipping the seed is the safe outcome. Needs a base_commit to
        # compare against; costs one extra admin eval on the selection split.
        if self.auto_best_baseline_floor and self.base_commit is not None:
            base_dataset_id = self.selection_dataset_id or best_dataset_id
```

_{Reviews (1): Last reviewed commit: "fix(harbor): auto_best reverts to the ba..." | Re-trigger Greptile}

…s it auto_best excludes base_commit from the candidate pool, so when every candidate regressed it still selected the least-bad one and shipped a regression (observed live: an opus optimizer on a weak haiku inner model produced only below-baseline candidates and finalize shipped one 0.10 below the baseline, even though the free baseline reference was available). Visibility alone did not prevent the harm; nothing acted on it. Add a selection floor: after the admin re-score picks the best candidate, admin- score the untouched base_commit on the selection split and revert to it when the best candidate does not strictly beat it (a statistical tie reverts too: if the optimizer cannot show an improvement, shipping the seed is the safe outcome). On by default, gated on a base_commit being set; costs one extra admin eval. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(harbor): auto_best reverts to the baseline when no candidate beats it#28

fix(harbor): auto_best reverts to the baseline when no candidate beats it#28
shehabyasser-scale wants to merge 1 commit into
harbor-2-baseline-durablefrom
harbor-2-autobest-baseline-floor

shehabyasser-scale commented Jul 4, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

shehabyasser-scale commented Jul 4, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem (observed live, twice)

Fix

Tests

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (1)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shehabyasser-scale commented Jul 4, 2026 •

edited by greptile-apps Bot

Loading