fix(harbor): make baseline scoring at finalize durable and retried by shehabyasser-scale · Pull Request #27 · scaleapi/vero

shehabyasser-scale · 2026-07-04T15:38:33Z

Problem (observed live, then root-caused)

A trial silently skipped score_baseline at finalize: reward.json landed ~2 minutes after the winner's validation eval with no baseline.json and no baseline ledger row, while a byte-identical config produced one in another run. A three-agent investigation traced the mechanism: the nested baseline eval failed transiently within seconds, _maybe_score_baseline's warn-and-continue except swallowed it, and the only record (a logger.exception to sidecar stderr) was destroyed with the container at teardown. The admin volume holding baseline.json does not survive teardown either. Two subsequent identical runs scored the baseline fine, confirming the failure is intermittent and environmental, not config- or state-dependent.

Fix

Retry: the baseline eval gets baseline_score_attempts total attempts (default 2, wired through ServeConfig), so a single transient nested-run failure no longer drops the regression check.
Durable outcome: finalize() now returns {"rewards": ..., "baseline": ...} where baseline is {"scores": ...}, {"skipped": reason} or {"error": ..., "error_type": ..., "attempts": N}. The CLI writes only rewards to reward.json (the outer harness contract is unchanged) and echoes the full payload to stdout, which harbor captures into the trial's test-stdout.txt on the host: the one channel that survives teardown. The CLI tolerates the old bare-rewards response shape for a mixed-version sidecar.
baseline.json is still written to the admin volume (best-effort) for live in-cluster debugging.

Baseline scoring still never fails the trial.

Tests

transient failure then success: recovers on retry, reports attempts: 2
persistent failure: retries, reports the error in the response, trial reward unaffected, nothing persisted
CLI: wrapper response writes only rewards to reward.json and echoes the baseline to stdout; bare-rewards back-compat kept
app-layer /finalize mock updated to the real wrapper shape
existing finalize tests updated to the wrapper (serve integration suite passes)

Stacked on #25 (harbor-2-free-baseline-eval). An adversarial review panel (4 lenses + verification) ran pre-open; all confirmed findings addressed.

🤖 Generated with Claude Code

Greptile Summary

This PR makes the baseline-scoring step at finalize both retryable and durably observable: instead of swallowing a transient failure into a log line that dies with the container, finalize() now returns {"rewards": {...}, "baseline": {...}} where baseline holds either the scores, a skip reason, or a structured error after all retry attempts are exhausted.

Retry: _maybe_score_baseline loops up to baseline_score_attempts (default 2) times; a single transient nested-run failure no longer silently drops the regression check.
Durable outcome: the CLI writes only rewards to reward.json (outer harness contract unchanged) and echoes the full response — including the baseline outcome — to stdout, which Harbor captures into test-stdout.txt on the host, the one channel that survives sidecar teardown.
Back-compat: the CLI tolerates the older bare-rewards response shape so a mixed-version sidecar still produces a valid reward.json.

Confidence Score: 4/5

Safe to merge once the logging issue in verifier.py is addressed; reward.json output and the retry semantics are correct.

The final 'all attempts failed' log message calls logger.exception(..., exc_info=last_error) outside any except block, passing an exception instance rather than a 3-tuple. Python 3.11 (the project's minimum version per pyproject.toml) only accepts tuples or True for exc_info — a plain exception object causes the logging module to fall back to sys.exc_info(), which returns (None, None, None) outside an exception context, silently discarding the traceback. Improved traceback observability at failure was one of the stated goals of this PR. Everything else — the retry loop, the response wrapper, the CLI back-compat guard, and the new tests — is correct and well-tested.

vero/src/vero/harbor/verifier.py — the logger.exception call at the end of _maybe_score_baseline

Important Files Changed

Filename	Overview
vero/src/vero/harbor/verifier.py	Core change: _maybe_score_baseline now retries up to _baseline_score_attempts times and returns a structured dict instead of None; finalize wraps result in {rewards, baseline}. One issue: logger.exception(..., exc_info=last_error) outside an except block silently drops the traceback on Python 3.11 (the minimum supported version).
vero/src/vero/harbor/cli.py	CLI now extracts resp["rewards"] for reward.json and echoes the full response (including baseline outcome) to stdout; backward-compat guard for bare-rewards shape is correct.
vero/src/vero/harbor/serve.py	Adds baseline_score_attempts: int = 2 to ServeConfig and threads it through to Verifier. Clean and minimal change.
vero/tests/test_harbor_verifier.py	Tests updated to the new wrapper shape; adds two new cases: persistent failure (retries, reports error, trial unaffected) and transient failure recovery. Coverage is thorough for the happy path and the two key failure modes.
vero/tests/test_harbor_cli.py	New test covers the wrapper response: verifies that reward.json contains only the rewards and that the baseline outcome appears in stdout. Back-compat test preserved.
vero/tests/test_harbor_app.py	Mock updated to reflect the real finalize wrapper shape; assertion updated accordingly. Straightforward.
vero/tests/test_harbor_serve.py	Integration tests updated to unwrap ["rewards"] from the finalize result. No logic changes to the assertions themselves.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant H as Harbor (host)
    participant CLI as vero harbor finalize (CLI)
    participant App as /finalize endpoint (FastAPI)
    participant V as Verifier
    participant E as EvaluationEngine

    H->>CLI: invoke finalize --token-file --output
    CLI->>App: POST /finalize (Bearer token)
    App->>V: finalize()
    V->>E: _select_commit()
    E-->>V: sha
    loop for each target
        V->>E: "evaluate_admin(commit=sha)"
        E-->>V: score
    end
    V->>V: _maybe_score_baseline(rewards)
    loop attempt 1..N (baseline_score_attempts)
        V->>E: "evaluate_admin(commit=base_commit)"
        alt success
            E-->>V: baseline score
            V-->>V: "return {scores, attempts}"
        else transient failure
            E-->>V: Exception
            V->>V: log warning, retry
        end
    end
    alt all attempts failed
        V-->>V: "return {error, error_type, attempts}"
    end
    V-->>App: "{rewards, baseline}"
    App-->>CLI: HTTP 200, JSON payload
    CLI->>CLI: extract rewards (or bare resp for back-compat)
    CLI->>CLI: write rewards to reward.json
    CLI->>H: echo full resp to stdout (captured into test-stdout.txt)

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant H as Harbor (host)
    participant CLI as vero harbor finalize (CLI)
    participant App as /finalize endpoint (FastAPI)
    participant V as Verifier
    participant E as EvaluationEngine

    H->>CLI: invoke finalize --token-file --output
    CLI->>App: POST /finalize (Bearer token)
    App->>V: finalize()
    V->>E: _select_commit()
    E-->>V: sha
    loop for each target
        V->>E: "evaluate_admin(commit=sha)"
        E-->>V: score
    end
    V->>V: _maybe_score_baseline(rewards)
    loop attempt 1..N (baseline_score_attempts)
        V->>E: "evaluate_admin(commit=base_commit)"
        alt success
            E-->>V: baseline score
            V-->>V: "return {scores, attempts}"
        else transient failure
            E-->>V: Exception
            V->>V: log warning, retry
        end
    end
    alt all attempts failed
        V-->>V: "return {error, error_type, attempts}"
    end
    V-->>App: "{rewards, baseline}"
    App-->>CLI: HTTP 200, JSON payload
    CLI->>CLI: extract rewards (or bare resp for back-compat)
    CLI->>CLI: write rewards to reward.json
    CLI->>H: echo full resp to stdout (captured into test-stdout.txt)

Prompt To Fix All With AI

Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
vero/src/vero/harbor/verifier.py:195-199
`logger.exception(..., exc_info=last_error)` is called outside an `except` block with an exception instance as `exc_info`. In Python 3.11 (the minimum required by `pyproject.toml`), the `logging.LogRecord` constructor treats any non-tuple `exc_info` value by falling through to `sys.exc_info()`, which returns `(None, None, None)` when there is no active exception context — so the traceback of `last_error` is silently dropped from the final log message. Exception-instance support for `exc_info` was only added in Python 3.12. Passing an explicit 3-tuple makes this work on all supported versions.

```suggestion
        logger.error(
            "baseline scoring failed after %d attempt(s); reward.json is unaffected",
            self._baseline_score_attempts,
            exc_info=(type(last_error), last_error, last_error.__traceback__),
        )
```

_{Reviews (1): Last reviewed commit: "fix(harbor): make baseline scoring at fi..." | Re-trigger Greptile}

Greptile also left 1 inline comment on this PR.

A live trial silently skipped baseline scoring: the nested baseline eval failed transiently, the failure was swallowed by a warn-and-continue except, and the only record (a log line) died with the sidecar container at teardown. The admin volume that baseline.json was written to does not survive teardown either, so there was no durable evidence of whether the baseline was scored, skipped, or crashed. Two changes: - Retry the baseline eval (default 2 attempts) so a single transient nested-run failure does not drop the regression check. - finalize() now returns {"rewards": ..., "baseline": ...}; the baseline outcome (scores, a skip reason, or an error) is surfaced in the finalize response. The CLI writes only rewards to reward.json (the outer harness consumes its keys, unchanged) and echoes the full payload to stdout, which is captured into the trial's stdout on the host, the one channel that survives teardown. The CLI tolerates the old bare-rewards shape for a mixed-version sidecar. Baseline scoring still never fails the trial. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

greptile-apps · 2026-07-04T15:43:06Z

+        logger.exception(
+            "baseline scoring failed after %d attempt(s); reward.json is unaffected",
+            self._baseline_score_attempts,
+            exc_info=last_error,
+        )


logger.exception(..., exc_info=last_error) is called outside an except block with an exception instance as exc_info. In Python 3.11 (the minimum required by pyproject.toml), the logging.LogRecord constructor treats any non-tuple exc_info value by falling through to sys.exc_info(), which returns (None, None, None) when there is no active exception context — so the traceback of last_error is silently dropped from the final log message. Exception-instance support for exc_info was only added in Python 3.12. Passing an explicit 3-tuple makes this work on all supported versions.

Suggested change

logger.exception(

"baseline scoring failed after %d attempt(s); reward.json is unaffected",

self._baseline_score_attempts,

exc_info=last_error,

)

logger.error(

"baseline scoring failed after %d attempt(s); reward.json is unaffected",

self._baseline_score_attempts,

exc_info=(type(last_error), last_error, last_error.__traceback__),

)

Prompt To Fix With AI

This is a comment left during a code review. Path: vero/src/vero/harbor/verifier.py Line: 195-199 Comment: `logger.exception(..., exc_info=last_error)` is called outside an `except` block with an exception instance as `exc_info`. In Python 3.11 (the minimum required by `pyproject.toml`), the `logging.LogRecord` constructor treats any non-tuple `exc_info` value by falling through to `sys.exc_info()`, which returns `(None, None, None)` when there is no active exception context — so the traceback of `last_error` is silently dropped from the final log message. Exception-instance support for `exc_info` was only added in Python 3.12. Passing an explicit 3-tuple makes this work on all supported versions. ```suggestion logger.error( "baseline scoring failed after %d attempt(s); reward.json is unaffected", self._baseline_score_attempts, exc_info=(type(last_error), last_error, last_error.__traceback__), ) ``` How can I resolve this? If you propose a fix, please make it concise.

shehabyasser-scale mentioned this pull request Jul 4, 2026

fix(harbor): auto_best reverts to the baseline when no candidate beats it #28

Open

greptile-apps Bot reviewed Jul 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(harbor): make baseline scoring at finalize durable and retried#27

fix(harbor): make baseline scoring at finalize durable and retried#27
shehabyasser-scale wants to merge 1 commit into
harbor-2-free-baseline-evalfrom
harbor-2-baseline-durable

shehabyasser-scale commented Jul 4, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

greptile-apps Bot Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

shehabyasser-scale commented Jul 4, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem (observed live, then root-caused)

Fix

Tests

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Jul 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shehabyasser-scale commented Jul 4, 2026 •

edited by greptile-apps Bot

Loading