Skip to content

fix(harbor): make baseline scoring at finalize durable and retried#27

Open
shehabyasser-scale wants to merge 1 commit into
harbor-2-free-baseline-evalfrom
harbor-2-baseline-durable
Open

fix(harbor): make baseline scoring at finalize durable and retried#27
shehabyasser-scale wants to merge 1 commit into
harbor-2-free-baseline-evalfrom
harbor-2-baseline-durable

Conversation

@shehabyasser-scale

@shehabyasser-scale shehabyasser-scale commented Jul 4, 2026

Copy link
Copy Markdown
Collaborator

Problem (observed live, then root-caused)

A trial silently skipped score_baseline at finalize: reward.json landed ~2 minutes after the winner's validation eval with no baseline.json and no baseline ledger row, while a byte-identical config produced one in another run. A three-agent investigation traced the mechanism: the nested baseline eval failed transiently within seconds, _maybe_score_baseline's warn-and-continue except swallowed it, and the only record (a logger.exception to sidecar stderr) was destroyed with the container at teardown. The admin volume holding baseline.json does not survive teardown either. Two subsequent identical runs scored the baseline fine, confirming the failure is intermittent and environmental, not config- or state-dependent.

Fix

  • Retry: the baseline eval gets baseline_score_attempts total attempts (default 2, wired through ServeConfig), so a single transient nested-run failure no longer drops the regression check.
  • Durable outcome: finalize() now returns {"rewards": ..., "baseline": ...} where baseline is {"scores": ...}, {"skipped": reason} or {"error": ..., "error_type": ..., "attempts": N}. The CLI writes only rewards to reward.json (the outer harness contract is unchanged) and echoes the full payload to stdout, which harbor captures into the trial's test-stdout.txt on the host: the one channel that survives teardown. The CLI tolerates the old bare-rewards response shape for a mixed-version sidecar.
  • baseline.json is still written to the admin volume (best-effort) for live in-cluster debugging.

Baseline scoring still never fails the trial.

Tests

  • transient failure then success: recovers on retry, reports attempts: 2
  • persistent failure: retries, reports the error in the response, trial reward unaffected, nothing persisted
  • CLI: wrapper response writes only rewards to reward.json and echoes the baseline to stdout; bare-rewards back-compat kept
  • app-layer /finalize mock updated to the real wrapper shape
  • existing finalize tests updated to the wrapper (serve integration suite passes)

Stacked on #25 (harbor-2-free-baseline-eval). An adversarial review panel (4 lenses + verification) ran pre-open; all confirmed findings addressed.

🤖 Generated with Claude Code

Greptile Summary

This PR makes the baseline-scoring step at finalize both retryable and durably observable: instead of swallowing a transient failure into a log line that dies with the container, finalize() now returns {"rewards": {...}, "baseline": {...}} where baseline holds either the scores, a skip reason, or a structured error after all retry attempts are exhausted.

  • Retry: _maybe_score_baseline loops up to baseline_score_attempts (default 2) times; a single transient nested-run failure no longer silently drops the regression check.
  • Durable outcome: the CLI writes only rewards to reward.json (outer harness contract unchanged) and echoes the full response — including the baseline outcome — to stdout, which Harbor captures into test-stdout.txt on the host, the one channel that survives sidecar teardown.
  • Back-compat: the CLI tolerates the older bare-rewards response shape so a mixed-version sidecar still produces a valid reward.json.

Confidence Score: 4/5

Safe to merge once the logging issue in verifier.py is addressed; reward.json output and the retry semantics are correct.

The final 'all attempts failed' log message calls logger.exception(..., exc_info=last_error) outside any except block, passing an exception instance rather than a 3-tuple. Python 3.11 (the project's minimum version per pyproject.toml) only accepts tuples or True for exc_info — a plain exception object causes the logging module to fall back to sys.exc_info(), which returns (None, None, None) outside an exception context, silently discarding the traceback. Improved traceback observability at failure was one of the stated goals of this PR. Everything else — the retry loop, the response wrapper, the CLI back-compat guard, and the new tests — is correct and well-tested.

vero/src/vero/harbor/verifier.py — the logger.exception call at the end of _maybe_score_baseline

Important Files Changed

Filename Overview
vero/src/vero/harbor/verifier.py Core change: _maybe_score_baseline now retries up to _baseline_score_attempts times and returns a structured dict instead of None; finalize wraps result in {rewards, baseline}. One issue: logger.exception(..., exc_info=last_error) outside an except block silently drops the traceback on Python 3.11 (the minimum supported version).
vero/src/vero/harbor/cli.py CLI now extracts resp["rewards"] for reward.json and echoes the full response (including baseline outcome) to stdout; backward-compat guard for bare-rewards shape is correct.
vero/src/vero/harbor/serve.py Adds baseline_score_attempts: int = 2 to ServeConfig and threads it through to Verifier. Clean and minimal change.
vero/tests/test_harbor_verifier.py Tests updated to the new wrapper shape; adds two new cases: persistent failure (retries, reports error, trial unaffected) and transient failure recovery. Coverage is thorough for the happy path and the two key failure modes.
vero/tests/test_harbor_cli.py New test covers the wrapper response: verifies that reward.json contains only the rewards and that the baseline outcome appears in stdout. Back-compat test preserved.
vero/tests/test_harbor_app.py Mock updated to reflect the real finalize wrapper shape; assertion updated accordingly. Straightforward.
vero/tests/test_harbor_serve.py Integration tests updated to unwrap ["rewards"] from the finalize result. No logic changes to the assertions themselves.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant H as Harbor (host)
    participant CLI as vero harbor finalize (CLI)
    participant App as /finalize endpoint (FastAPI)
    participant V as Verifier
    participant E as EvaluationEngine

    H->>CLI: invoke finalize --token-file --output
    CLI->>App: POST /finalize (Bearer token)
    App->>V: finalize()
    V->>E: _select_commit()
    E-->>V: sha
    loop for each target
        V->>E: "evaluate_admin(commit=sha)"
        E-->>V: score
    end
    V->>V: _maybe_score_baseline(rewards)
    loop attempt 1..N (baseline_score_attempts)
        V->>E: "evaluate_admin(commit=base_commit)"
        alt success
            E-->>V: baseline score
            V-->>V: "return {scores, attempts}"
        else transient failure
            E-->>V: Exception
            V->>V: log warning, retry
        end
    end
    alt all attempts failed
        V-->>V: "return {error, error_type, attempts}"
    end
    V-->>App: "{rewards, baseline}"
    App-->>CLI: HTTP 200, JSON payload
    CLI->>CLI: extract rewards (or bare resp for back-compat)
    CLI->>CLI: write rewards to reward.json
    CLI->>H: echo full resp to stdout (captured into test-stdout.txt)
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant H as Harbor (host)
    participant CLI as vero harbor finalize (CLI)
    participant App as /finalize endpoint (FastAPI)
    participant V as Verifier
    participant E as EvaluationEngine

    H->>CLI: invoke finalize --token-file --output
    CLI->>App: POST /finalize (Bearer token)
    App->>V: finalize()
    V->>E: _select_commit()
    E-->>V: sha
    loop for each target
        V->>E: "evaluate_admin(commit=sha)"
        E-->>V: score
    end
    V->>V: _maybe_score_baseline(rewards)
    loop attempt 1..N (baseline_score_attempts)
        V->>E: "evaluate_admin(commit=base_commit)"
        alt success
            E-->>V: baseline score
            V-->>V: "return {scores, attempts}"
        else transient failure
            E-->>V: Exception
            V->>V: log warning, retry
        end
    end
    alt all attempts failed
        V-->>V: "return {error, error_type, attempts}"
    end
    V-->>App: "{rewards, baseline}"
    App-->>CLI: HTTP 200, JSON payload
    CLI->>CLI: extract rewards (or bare resp for back-compat)
    CLI->>CLI: write rewards to reward.json
    CLI->>H: echo full resp to stdout (captured into test-stdout.txt)
Loading

Fix All in Cursor Fix All in Claude Code Fix All in Codex

Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
vero/src/vero/harbor/verifier.py:195-199
`logger.exception(..., exc_info=last_error)` is called outside an `except` block with an exception instance as `exc_info`. In Python 3.11 (the minimum required by `pyproject.toml`), the `logging.LogRecord` constructor treats any non-tuple `exc_info` value by falling through to `sys.exc_info()`, which returns `(None, None, None)` when there is no active exception context — so the traceback of `last_error` is silently dropped from the final log message. Exception-instance support for `exc_info` was only added in Python 3.12. Passing an explicit 3-tuple makes this work on all supported versions.

```suggestion
        logger.error(
            "baseline scoring failed after %d attempt(s); reward.json is unaffected",
            self._baseline_score_attempts,
            exc_info=(type(last_error), last_error, last_error.__traceback__),
        )
```

Reviews (1): Last reviewed commit: "fix(harbor): make baseline scoring at fi..." | Re-trigger Greptile

Greptile also left 1 inline comment on this PR.

A live trial silently skipped baseline scoring: the nested baseline eval failed
transiently, the failure was swallowed by a warn-and-continue except, and the only
record (a log line) died with the sidecar container at teardown. The admin volume
that baseline.json was written to does not survive teardown either, so there was no
durable evidence of whether the baseline was scored, skipped, or crashed.

Two changes:
- Retry the baseline eval (default 2 attempts) so a single transient nested-run
  failure does not drop the regression check.
- finalize() now returns {"rewards": ..., "baseline": ...}; the baseline outcome
  (scores, a skip reason, or an error) is surfaced in the finalize response. The CLI
  writes only rewards to reward.json (the outer harness consumes its keys, unchanged)
  and echoes the full payload to stdout, which is captured into the trial's stdout on
  the host, the one channel that survives teardown. The CLI tolerates the old
  bare-rewards shape for a mixed-version sidecar.

Baseline scoring still never fails the trial.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment on lines +195 to +199
logger.exception(
"baseline scoring failed after %d attempt(s); reward.json is unaffected",
self._baseline_score_attempts,
exc_info=last_error,
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 logger.exception(..., exc_info=last_error) is called outside an except block with an exception instance as exc_info. In Python 3.11 (the minimum required by pyproject.toml), the logging.LogRecord constructor treats any non-tuple exc_info value by falling through to sys.exc_info(), which returns (None, None, None) when there is no active exception context — so the traceback of last_error is silently dropped from the final log message. Exception-instance support for exc_info was only added in Python 3.12. Passing an explicit 3-tuple makes this work on all supported versions.

Suggested change
logger.exception(
"baseline scoring failed after %d attempt(s); reward.json is unaffected",
self._baseline_score_attempts,
exc_info=last_error,
)
logger.error(
"baseline scoring failed after %d attempt(s); reward.json is unaffected",
self._baseline_score_attempts,
exc_info=(type(last_error), last_error, last_error.__traceback__),
)
Prompt To Fix With AI
This is a comment left during a code review.
Path: vero/src/vero/harbor/verifier.py
Line: 195-199

Comment:
`logger.exception(..., exc_info=last_error)` is called outside an `except` block with an exception instance as `exc_info`. In Python 3.11 (the minimum required by `pyproject.toml`), the `logging.LogRecord` constructor treats any non-tuple `exc_info` value by falling through to `sys.exc_info()`, which returns `(None, None, None)` when there is no active exception context — so the traceback of `last_error` is silently dropped from the final log message. Exception-instance support for `exc_info` was only added in Python 3.12. Passing an explicit 3-tuple makes this work on all supported versions.

```suggestion
        logger.error(
            "baseline scoring failed after %d attempt(s); reward.json is unaffected",
            self._baseline_score_attempts,
            exc_info=(type(last_error), last_error, last_error.__traceback__),
        )
```

How can I resolve this? If you propose a fix, please make it concise.

Fix in Cursor Fix in Claude Code Fix in Codex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant