Skip to content

fix(harbor): auto_best reverts to the baseline when no candidate beats it#28

Open
shehabyasser-scale wants to merge 1 commit into
harbor-2-baseline-durablefrom
harbor-2-autobest-baseline-floor
Open

fix(harbor): auto_best reverts to the baseline when no candidate beats it#28
shehabyasser-scale wants to merge 1 commit into
harbor-2-baseline-durablefrom
harbor-2-autobest-baseline-floor

Conversation

@shehabyasser-scale

@shehabyasser-scale shehabyasser-scale commented Jul 4, 2026

Copy link
Copy Markdown
Collaborator

Problem (observed live, twice)

auto_best excludes base_commit from the candidate pool, so when every candidate regressed it still selected the least-bad one and shipped a regression. Observed in two independent weak-inner-model trials: candidates 0.25/0.194/0.167 shipped 0.267 vs a 0.367 baseline; then, with the free-baseline reference available and actually claimed by the optimizer, candidates 0.25/0.222/0.167 still shipped 0.167 vs a 0.267 baseline. Knowing where zero is did not stop the harm, because nothing acted on it: visibility without a selection floor.

Fix

A selection floor in _best_from_db: after the admin re-score picks the best candidate, admin-score the untouched base_commit on the selection split and revert to the seed when the best candidate does not strictly beat it. A statistical tie also reverts: if the optimizer cannot demonstrate an improvement, shipping the unmodified seed is the safe outcome for the customer.

  • auto_best_baseline_floor (default on), wired through ServeConfig; no-ops when base_commit is unset
  • Uses the trusted admin scorer for the base eval (same trust posture as the shortlist re-score); costs one extra admin eval
  • Logged either way (revert or keep) with both scores

Tests

  • reverts when the best candidate loses to the baseline: asserts the emitted reward is the seed's target-split score (not the regression) and the final eval is the target eval of the reverted commit
  • exact tie reverts (pins the <= boundary against a refactor to <)
  • candidate that beats the baseline is kept
  • floor off restores the old behavior (base never scored)
  • floor on with no base_commit silently no-ops (never evaluates commit=None)

Stacked on #27 (harbor-2-baseline-durable). An adversarial review panel (4 lenses + verification) ran pre-open; all confirmed findings addressed.

🤖 Generated with Claude Code

Greptile Summary

This PR adds a selection floor to auto_best mode: after admin re-scoring the candidate shortlist, the untouched base_commit is also admin-scored on the selection split, and the winner is reverted to the seed if it does not strictly beat it. The fix closes a real production path where every candidate regressed but the least-bad one was still shipped.

  • verifier.py: New auto_best_baseline_floor flag (default True); in _best_from_db, after picking the best candidate, the base commit is admin-evaluated on the selection split and a <= comparison triggers a revert to base_commit. A None base_commit silently no-ops.
  • serve.py: ServeConfig exposes the new flag and wires it through build_components.
  • test_harbor_verifier.py: Five new tests cover revert-on-regression, exact-tie revert, floor no-op without base_commit, candidate-beats-baseline keep, and floor-off backward compatibility; existing tests updated to set auto_best_baseline_floor=False where they isolate unrelated ranking behavior.

Confidence Score: 4/5

The floor logic is sound and closes a real regression path; the one edge case (cross-dataset floor comparison when selection_dataset_id is unconfigured and the admin re-scoring reorders the shortlist) is unlikely to fire in well-configured deployments and has no effect when selection_dataset_id is set.

The core revert logic, the <= tie-break, and the base_commit=None no-op are all correct and well-tested. The dataset_id used for the base floor eval is borrowed from the top recorded-score shortlist entry rather than from the actual winner's row, which can make the comparison cross-dataset in the unconstrained-dataset scenario.

The dataset_id fallback in the floor block of vero/src/vero/harbor/verifier.py (lines 304-306) deserves a second look if selection_dataset_id is ever left unset in production configs.

Important Files Changed

Filename Overview
vero/src/vero/harbor/verifier.py Adds the baseline floor in _best_from_db; one edge case in the base_dataset_id fallback can produce a cross-dataset floor comparison when selection_dataset_id is None and the admin re-scoring promotes a non-first shortlist entry.
vero/src/vero/harbor/serve.py Adds auto_best_baseline_floor: bool = True to ServeConfig and passes it through to Verifier; straightforward wiring, no issues.
vero/tests/test_harbor_verifier.py Five well-structured new tests cover the key floor boundaries (revert, tie, no-op, keep, floor-off); existing tests correctly updated with auto_best_baseline_floor=False to isolate their original concerns.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant F as finalize()
    participant B as _best_from_db()
    participant E as EvaluationEngine

    F->>B: _best_from_db()
    B->>E: get_experiments_df()
    E-->>B: df
    note over B: filter to selection_split,<br/>exclude base_commit,<br/>shortlist top-K by recorded score
    loop for each shortlisted candidate
        B->>E: "evaluate_admin(commit=candidate, split=selection_split)"
        E-->>B: admin_score
    end
    note over B: sort by admin_score -> best_commit

    alt "auto_best_baseline_floor=True AND base_commit set"
        B->>E: "evaluate_admin(commit=base_commit, split=selection_split)"
        E-->>B: base_score
        alt "best_score <= base_score"
            B-->>F: return base_commit (revert to seed)
        else "best_score > base_score"
            B-->>F: return best_commit
        end
    else floor disabled or no base_commit
        B-->>F: return best_commit
    end

    loop for each VerificationTarget
        F->>E: "evaluate_admin(commit=sha, split=target.split)"
        E-->>F: target score
    end
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant F as finalize()
    participant B as _best_from_db()
    participant E as EvaluationEngine

    F->>B: _best_from_db()
    B->>E: get_experiments_df()
    E-->>B: df
    note over B: filter to selection_split,<br/>exclude base_commit,<br/>shortlist top-K by recorded score
    loop for each shortlisted candidate
        B->>E: "evaluate_admin(commit=candidate, split=selection_split)"
        E-->>B: admin_score
    end
    note over B: sort by admin_score -> best_commit

    alt "auto_best_baseline_floor=True AND base_commit set"
        B->>E: "evaluate_admin(commit=base_commit, split=selection_split)"
        E-->>B: base_score
        alt "best_score <= base_score"
            B-->>F: return base_commit (revert to seed)
        else "best_score > base_score"
            B-->>F: return best_commit
        end
    else floor disabled or no base_commit
        B-->>F: return best_commit
    end

    loop for each VerificationTarget
        F->>E: "evaluate_admin(commit=sha, split=target.split)"
        E-->>F: target score
    end
Loading

Comments Outside Diff (1)

  1. vero/src/vero/harbor/verifier.py, line 270-306 (link)

    P2 When selection_dataset_id is None, base_dataset_id falls back to shortlist.iloc[0].get("dataset_subset_dataset_id") — the dataset of the highest recorded-score candidate. But the admin re-scoring can promote a different shortlist entry as best_commit, and each candidate is evaluated on its own row's dataset_subset_dataset_id. When those two entries differ, the floor compares best_score (scored on the winner's dataset) against base_score (scored on the first shortlist entry's dataset) — a cross-dataset comparison that can falsely keep or falsely revert. Storing the dataset_id in the rescored tuple and using the winner's own dataset_id for the base eval keeps the comparison apples-to-apples.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: vero/src/vero/harbor/verifier.py
    Line: 270-306
    
    Comment:
    When `selection_dataset_id` is `None`, `base_dataset_id` falls back to `shortlist.iloc[0].get("dataset_subset_dataset_id")` — the dataset of the highest *recorded-score* candidate. But the admin re-scoring can promote a different shortlist entry as `best_commit`, and each candidate is evaluated on its own row's `dataset_subset_dataset_id`. When those two entries differ, the floor compares `best_score` (scored on the winner's dataset) against `base_score` (scored on the first shortlist entry's dataset) — a cross-dataset comparison that can falsely keep or falsely revert. Storing the dataset_id in the `rescored` tuple and using the winner's own dataset_id for the base eval keeps the comparison apples-to-apples.
    
    
    
    How can I resolve this? If you propose a fix, please make it concise.

    Fix in Cursor Fix in Claude Code Fix in Codex

Fix All in Cursor Fix All in Claude Code Fix All in Codex

Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
vero/src/vero/harbor/verifier.py:270-306
When `selection_dataset_id` is `None`, `base_dataset_id` falls back to `shortlist.iloc[0].get("dataset_subset_dataset_id")` — the dataset of the highest *recorded-score* candidate. But the admin re-scoring can promote a different shortlist entry as `best_commit`, and each candidate is evaluated on its own row's `dataset_subset_dataset_id`. When those two entries differ, the floor compares `best_score` (scored on the winner's dataset) against `base_score` (scored on the first shortlist entry's dataset) — a cross-dataset comparison that can falsely keep or falsely revert. Storing the dataset_id in the `rescored` tuple and using the winner's own dataset_id for the base eval keeps the comparison apples-to-apples.

```suggestion
        rescored: list[tuple[float, int, str, str | None]] = []
        for idx, (_, row) in enumerate(shortlist.iterrows()):
            commit = row["candidate_commit"]
            dataset_id = row.get("dataset_subset_dataset_id")
            exp = await self.engine.evaluate_admin(
                task=self.selection_task,
                dataset_id=dataset_id,
                split=self.selection_split,
                commit=commit,
            )
            score = exp.result.score()
            admin_score = float(score) if score is not None else default_minimum_score
            # Tie-break by shortlist position (already ordered by recorded score
            # then recency), so ties resolve deterministically without depending on
            # the type of candidate_created_at (a datetime in the real DB).
            rescored.append((admin_score, idx, commit, dataset_id))
            logger.info(
                "auto_best re-score: commit=%s admin_score=%s (recorded=%s)",
                commit,
                admin_score,
                row["mean_score"],
            )
        # Highest admin score wins; ties break to the earliest shortlist position.
        rescored.sort(key=lambda t: (-t[0], t[1]))
        best_score, _, best_commit, best_dataset_id = rescored[0]

        # Selection floor: never ship a candidate that fails to beat the untouched
        # baseline on the selection split. auto_best excludes base_commit from the
        # candidate pool, so without this it selects the least-bad candidate even
        # when every candidate regressed. Revert to the seed instead. Strict '>' so
        # a statistical tie also reverts: if the optimizer cannot show an
        # improvement, shipping the seed is the safe outcome. Needs a base_commit to
        # compare against; costs one extra admin eval on the selection split.
        if self.auto_best_baseline_floor and self.base_commit is not None:
            base_dataset_id = self.selection_dataset_id or best_dataset_id
```

Reviews (1): Last reviewed commit: "fix(harbor): auto_best reverts to the ba..." | Re-trigger Greptile

…s it

auto_best excludes base_commit from the candidate pool, so when every candidate
regressed it still selected the least-bad one and shipped a regression (observed
live: an opus optimizer on a weak haiku inner model produced only below-baseline
candidates and finalize shipped one 0.10 below the baseline, even though the free
baseline reference was available). Visibility alone did not prevent the harm;
nothing acted on it.

Add a selection floor: after the admin re-score picks the best candidate, admin-
score the untouched base_commit on the selection split and revert to it when the
best candidate does not strictly beat it (a statistical tie reverts too: if the
optimizer cannot show an improvement, shipping the seed is the safe outcome). On
by default, gated on a base_commit being set; costs one extra admin eval.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant